I have html files with the following structure:
<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<p>paragraph 3 - bladiblah di blahblah</p>
<p>paragraph 4 - bladiblah di blahblah</p>
<p>paragraph 5 - bladiblah di blahblah</p>
<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<p>paragraph 3 - bladiblah di blahblah</p>
From each block of text headed by a <h2>, I need to mark the last <p> with some specific markup.
How can I locate the last <p> in each textblock using regular expressions
Any suggestions welcome,
Zjivago

How to find the last occurence of <p> within a textblock tiltled <h2>??
AlanKohl
I'm using regexes already on these files, and so far, I've managed to accomplish all tasks using regexes.
The best I've come up so far is:
(<p>)([^<>]*)(</p>)(\s)*( =<h2>)
This pattern looks for the string "<p>", followed by zero to unlimited characters other than "<" and ">", followed by the string "</p>", followed by zero to unlimited whitespace characters (spaces, tabs, linebreaks...), followed by the string "<h2>". The last is declared as positive lookahead.
This pattern actually selects the <p> before the next <h2>. Unfortenately, the pattern breaks when the <p> paragraph contains html markup, such as <u>...</u>, dissattisfying the 2nd term.
moondaddy
If in a block i get Last index of "<p>" and using that index I work on the string/text
jimgong
To what extend differs the term ( :<string>) from ( =<string>) or ( !<string>)
Bo_
U wrote: ***To what extend differs the term ( :<string>) from ( =<string>) or ( !<string>) ***
u r mixing two different types of entities here:
1. look-arounds ( = ) ( ! ) ( <= ) ( <! ) AND
2. non-capturing groups: ( : )
they have nothing in common. U might want to read the MSDN docs for .NET Regex Object to see what they are about exactly. They have nice code snippets too.
samantha chandrasekar
ok got it finally:
use:
<p>(<u>[^<>]*</u>|[^<>])*</p>( =\s*(<h2>|\Z))
w/ SingleLine ON
it'll pick up bolded text from the input:
<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<p>paragraph 3 - bladiblah di blahblah</p>
<p>paragraph 4 - bladiblah di blahblah</p>
<p>
<u>paragraph 5 - bladiblah di blahblah
</u>
</p>
<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<p>paragraph 3 - bladiblah di blahblah</p>
dczraptor
Will String.LastIndexOf Method help you
Unknown Name
DawnJ
Therefor, I'll come up with a better representation of the actual data:
<h2>article title 1</h2>
<p>paragraph 1 - bladiblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 2 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<h2>article title 2</h2>
<p>paragraph 1 - bladiblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 2 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 3 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 4 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<h2>article title 3</h2>
<p>paragraph 1 - bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 2 - bladiblah di blahblah.</p>
<p>paragraph 3 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
Could you tell me what the " ( =\s*( :<h2>|\Z)) " part does
As far as I understand this pattern, it says: lookahead for zero or unlimited whitespace characters... oops... Don't know what the " ( :<h2>) " term does...
Could you explain this, Sergei Z Thanks!
Harinarayan
i tried this
<p>.* </p>( =\s*( :<h2>|\Z))
w/SingleLine On and matched an all 3 occurrences of <P> tags from your original text:
<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<p>paragraph 3 - bladiblah di blahblah</p>
<p>paragraph 4 - bladiblah di blahblah</p>
<p>paragraph 5 - bladiblah di blahblah</p>
<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<h2>article title</h2>
<p>paragraph 1 - bladiblah di blahblah</p>
<p>paragraph 2 - bladiblah di blahblah</p>
<p>paragraph 3 - bladiblah di blahblah</p>
so how are <u> tags interfering here can sed the text
NoEgo
Its a non capturing group...which basically says match the item, but don't place it in the captured list. What it allows one to do is
abcdef
( :ab)( <WhatIWant>\w{4})
and the only match that comes back is in the WhatIWant group
cdef
If one turns on ExplicitCapture, Then any non labeled match captures are disregarded without having to say ( : xxx) The above example could be changed if ExplicitCapture is turned on to
(ab)( <WhatIWant>\w{4})
It is very useful in weeding out items that are not needed but when you want to keep the full match. Check out the documentation on Group Constructs.
cap1000
just ran the pattern
<p>( :<u>[^<>]*</u>|[^<>])*</p>( =\s*( :<h2>|\Z))
in Expresso (.NETregex engine) vs your latest input ( from which u sh'have started the thread btw). got matches - bolded -according to your spec:
<h2>article title 1</h2>
<p>paragraph 1 - bladiblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 2 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<h2>article title 2</h2>
<p>paragraph 1 - bladiblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 2 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 3 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 4 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<h2>article title 3</h2>
<p>paragraph 1 - bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
<p>paragraph 2 - bladiblah di blahblah.</p>
<p>paragraph 3 - bladiblah di blahblah
di blahblah. bladiblah di blahblah
di blahblah. bladiblah di blahblah.</p>
( : ) simply instructs the engine not to bother with capturing the text into a group: it's called non-capturing group. Speeds up processing.
Jeff Irish
JotaC
Ashish,
could you please stop spitting out one-liners and tell me exactly how u r going to find the LAST occurrence of <p> within a text block <h2> I'm still curious. Thanks.
Sergei Z
LastBoyScout
Asnish,
pls explain in detail how String.LastIndexOf can help in the situation. I'm very curious. Thanks.