How to find the last occurence of <p> within a textblock tiltled <h2>??

I have html files with the following structure:

<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<p>paragraph 3 - bladiblah di blahblah</p>
<p>paragraph 4 - bladiblah di blahblah</p>
<p>paragraph 5 - bladiblah di blahblah</p>

<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>

<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<p>paragraph 3 - bladiblah di blahblah</p>

From each block of text headed by a <h2>, I need to mark the last <p> with some specific markup.

How can I locate the last <p> in each textblock using regular expressions

Any suggestions welcome,
Zjivago




Answer this question

How to find the last occurence of <p> within a textblock tiltled <h2>??

  • AlanKohl

    Interesting suggestion, Ashish, if it were not that I'm trying to do this using regular expressions.
    I'm using regexes already on these files, and so far, I've managed to accomplish all tasks using regexes.
    The best I've come up so far is:

    (<p>)([^<>]*)(</p>)(\s)*( =<h2>)

    This pattern looks for the string "<p>", followed by zero to unlimited characters other than "<" and ">", followed by the string "</p>", followed by zero to unlimited whitespace characters (spaces, tabs, linebreaks...), followed by the string "<h2>". The last is declared as positive lookahead.
    This pattern actually selects the <p> before the next <h2>. Unfortenately, the pattern breaks when the <p> paragraph contains html markup, such as <u>...</u>, dissattisfying the 2nd term.


  • moondaddy

    If in a block i get Last index of "<p>" and using that index I work on the string/text



  • jimgong

    Looks pretty good, Sergei Z. Great work!

    To what extend differs the term ( :<string>) from ( =<string>) or ( !<string>)


  • Bo_

    U wrote: ***To what extend differs the term ( :<string>) from ( =<string>) or ( !<string>) ***

    u r mixing two different types of entities here:

    1. look-arounds ( = ) ( ! ) ( <= ) ( <! ) AND

    2. non-capturing groups: ( : )

    they have nothing in common. U might want to read the MSDN docs for .NET Regex Object to see what they are about exactly. They have nice code snippets too.


  • samantha chandrasekar

    ok got it finally:

    use:

    <p>(<u>[^<>]*</u>|[^<>])*</p>( =\s*(<h2>|\Z))

    w/ SingleLine ON

    it'll pick up bolded text from the input:

    <h2>article title</h2>
    <p>paragraph 1 - bladiblah di blahblah</p>
    <p>paragraph 2 - bladiblah di blahblah</p>
    <p>paragraph 3 - bladiblah di blahblah</p>
    <p>paragraph 4 - bladiblah di blahblah</p>
    <p>
    <u>paragraph 5 - bladiblah di blahblah
    </u>
    </p>

    <h2>article title</h2>
    <p>paragraph 1 - bladiblah di blahblah</p>
    <p>paragraph 2 - bladiblah di blahblah</p>

    <h2>article title</h2>
    <p>paragraph 1 - bladiblah di blahblah</p>
    <p>paragraph 2 - bladiblah di blahblah</p>
    <p>paragraph 3 - bladiblah di blahblah</p>


  • dczraptor

    Will String.LastIndexOf Method help you



  • Unknown Name

    sorry, it does not really work, let me think about it..
  • DawnJ

    Not bad, Sergei Z, not bad. Unfortenately, when the text in a paragraph runs over multiple lines, the last paragraph does not get selected.
    Therefor, I'll come up with a better representation of the actual data:

    <h2>article title 1</h2>
    <p>paragraph 1 - bladiblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 2 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>

    <h2>article title 2</h2>
    <p>paragraph 1 - bladiblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 2 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 3 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 4 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>

    <h2>article title 3</h2>
    <p>paragraph 1 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 2 - bladiblah di blahblah.</p>
    <p>paragraph 3 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>



    C
    ould you tell me what the " ( =\s*( :<h2>|\Z)) " part does
    As far as I understand this pattern, it says: lookahead for zero or unlimited whitespace characters... oops... Don't know what the " ( :<h2>) " term does...
    Could you explain this, Sergei Z Thanks!



  • Harinarayan

    i tried this

    <p>.* </p>( =\s*( :<h2>|\Z))

    w/SingleLine On and matched an all 3 occurrences of <P> tags from your original text:

    <h2>article title</h2>
    <p>paragraph 1 - bladiblah di blahblah</p>
    <p>paragraph 2 - bladiblah di blahblah</p>
    <p>paragraph 3 - bladiblah di blahblah</p>
    <p>paragraph 4 - bladiblah di blahblah</p>
    <p>paragraph 5 - bladiblah di blahblah</p>

    <h2>article title</h2>
    <p>paragraph 1 - bladiblah di blahblah</p>
    <p>paragraph 2 - bladiblah di blahblah</p>

    <h2>article title</h2>
    <p>paragraph 1 - bladiblah di blahblah</p>
    <p>paragraph 2 - bladiblah di blahblah</p>
    <p>paragraph 3 - bladiblah di blahblah</p>

    so how are <u> tags interfering here can sed the text


  • NoEgo

     Zjivago wrote:
    Don't know what the " ( :<h2>) "


    Its a non capturing group...which basically says match the item, but don't place it in the captured list. What it allows one to do is

    abcdef

    ( :ab)( <WhatIWant>\w{4})

    and the only match that comes back is in the WhatIWant group

    cdef

    If one turns on ExplicitCapture, Then any non labeled match captures are disregarded without having to say ( : xxx) The above example could be changed if ExplicitCapture is turned on to

    (ab)( <WhatIWant>\w{4})

    It is very useful in weeding out items that are not needed but when you want to keep the full match. Check out the documentation on Group Constructs.




  • cap1000

    just ran the pattern

    <p>( :<u>[^<>]*</u>|[^<>])*</p>( =\s*( :<h2>|\Z))

    in Expresso (.NETregex engine) vs your latest input ( from which u sh'have started the thread btw). got matches - bolded -according to your spec:

    <h2>article title 1</h2>
    <p>paragraph 1 - bladiblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 2 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>

    <h2>article title 2</h2>
    <p>paragraph 1 - bladiblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 2 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 3 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 4 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>

    <h2>article title 3</h2>
    <p>paragraph 1 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>
    <p>paragraph 2 - bladiblah di blahblah.</p>
    <p>paragraph 3 - bladiblah di blahblah
    di blahblah. bladiblah di blahblah
    di blahblah. bladiblah di blahblah.</p>

    ( : ) simply instructs the engine not to bother with capturing the text into a group: it's called non-capturing group. Speeds up processing.


  • Jeff Irish

    the pattern should take care of <u> tags in case u have them inside <p> ones
  • JotaC

    Ashish,

    could you please stop spitting out one-liners and tell me exactly how u r going to find the LAST occurrence of <p> within a text block <h2> I'm still curious. Thanks.

    Sergei Z


  • LastBoyScout

    Asnish,

    pls explain in detail how String.LastIndexOf can help in the situation. I'm very curious. Thanks.


  • How to find the last occurence of <p> within a textblock tiltled <h2>??