how to NOT match text in tag

Hi all,

Here is a tricky question for regular expression. I want to match some words, say "red" in text body only, but NOT in tags; e.g., "<font color="red">this is red, test red 123</font>" I only want to grep "red" from "this is red, test red", but NOT in "<font color="red">". Nor can I remove tags before or after the match.

Thanks,

Ning



Answer this question

how to NOT match text in tag

  • Kevinmac

    i think it's an appplication-dependent issue. My experience (data and text processing) showed that negative look-arounds are more robust for such tasks if u r dealing with non-100% well-formed html. It's pure statistics. If html is well-formed, then yes, u r right, the positive look-ahead should work the same.
  • Bill Brennan

    Daniel,

    two more points:

    1. your pattern

    Regex regex = new Regex(@"( <=\>[^\<]+)red");

    will fail to match on red in

    <font color=""red"">red, test red 123</font>

    because u used "+"-quantifier instead of "*"

    2. u don't have to escape ">" and "<" when the chars are in the character class

    Regex regex = new Regex(@"( <=\>[^<]+)red");


  • ANS-Denver

    Thanks guys,

    I have this sorted out myself

    "red( :[^>]*(<|$))"

    Thanks,

    Ning


  • Daniel Karanov

    no, Daniel, u cannot [always] use the positive look-ahead to find those entries. For example, your logic will fail to match on the first occurrence of RED in the following input:

    RED <font color="red">this is red, test red 123</font>

    for tasks like this u need to use a negative look-ahead OR look-behind, as I suggested earlier. It is more robust, especially when u r dealing with random chunks of html code.


  • Marcin_Zawadzki

    Sergei,

    you're right, it must be an '*', not a '+'. But I still think that the positive look-ahead is a good choice. HTML will always begin with '< xml', '<!DOCTYPE' or '<html', so you'll always have a tag at the beginning that ends with '>'.

    --
    Regards,
    Daniel Kuppitz


  • Mike Batton

    try

    red( ![^<>]*>)

    with SingleLine Option ON


  • srfitz2000

    Hi Ning,

    using System;
    using System.Text.RegularExpressions;

    namespace
    ConsoleApplication1
    {
        class Program
        {
            static void Main(string[] args)
            {
                string test = @"<font color=""red"">this is red, test red 123</font>";

               
    Regex
    regex = new Regex(@"( <=\>[^\<]+)red"
    );
                MatchCollection matches = regex.Matches(test);

                foreach
    (Match m in
    matches)
                {
                    Console.Write("Position/Index {0}: ", m.Index);
                    Console.WriteLine(test.Substring(m.Index, m.Length));
                }
            }
        }
    }


    This will even work with line breaks in text and/or tags.

    --
    Regards,
    Daniel Kuppitz


  • how to NOT match text in tag