Regular expression gor parsing the html

can any one help me to get generic regular expression for my html file ....

My html is....

<tr class="evenline"><td width="40%" align="right">ModelPerReq</td><td width="60%" align="left"><b>Mand M2CU 10</b></td></tr>
<tr class="evenline"><td width="40%" align="right">Serial_1 num</td><td width="60%" align="left"><b>GFRTF%65*jh</b></td></tr>
<tr class="evenline"><td width="40%" align="right">Host name</td><td width="60%" align="left"><b>&lt;not known&gt;</b></td></tr>
 
here i need to extract only the text marked in red...... 
for this i've written RE like : >[\w\n ]*<
but it is not quite good always.........
can u pls give me a generic RE.....    
ThanX in advance


Answer this question

Regular expression gor parsing the html

  • Amde

    There are many article out on the net on parsing HTML/Xml Tags. There are problems such as the nesting between of the tags, which leads to getting the value or inner text which is more tags. I will present to you a means to extract data from an individual tag, whether it has attributes or not. If you run this on a file, you will need to drill down for each match with the expression until you cannot match ... at that point you will have text.

    The regex uses named capture groups so one can easily extract what you want...Tag | Attribute | Text. You are insterested in Text so access it via a match

    match.Groups("Text").Value


    RegEx for HTML Tags Imports System.Text.RegularExpressions

    ' Regular expression built for Visual Basic on: Thu, Sep 28, 2006, 01:42:30 PM
    ' Using Expresso Version: 2.1.2150, http://www.ultrapico.com
    '
    ' A description of the regular expression:
    '
    ' <
    ' [tag]: A named capture group. [\w*]
    ' Alphanumeric, any number of repetitions
    ' [attributes]: A named capture group. [[\w"'=%\s]*], zero or one repetitions
    ' Any character in this class: [\w"'=%\s], any number of repetitions
    ' >
    ' [text]: A named capture group. [.*]
    ' Any character, any number of repetitions
    ' </\k<tag>>
    ' </
    ' Backreference to capture named: tag
    '
    '

    Public Dim MyRegex as Regex = New Regex( _
    "<( <tag>\w*)( <attributes>[\w""'=%\s]*) >( <text>.*)</\k<tag>>", _
    RegexOptions.IgnoreCase _
    Or RegexOptions.Singleline _
    Or RegexOptions.CultureInvariant _
    Or RegexOptions.IgnorePatternWhitespace _
    Or RegexOptions.Compiled _
    )


  • Regular expression gor parsing the html