The following program is to strip the string of its HTML tags.
After running for several minutes, anyway, which is much longer than other normal program, then the error was complained finally.
Why Thanks.
<%@ Import Namespace="System"%>
<%@ Import Namespace="System.Text.RegularExpressions"%>
<script language="C#" runat="server">
void Page_Load(object sender, EventArgs e)
{
string temp = "<NOSCRIPT><IMG height=0 alt= src=HTMLfiles/s9.gif";
Response.Write(StripHTML(temp));
}
public static string StripHTML(string strHtml)
{
string [] aryReg ={
@"<script[^>]* >.* <" + "/script>",
@"<(\/\s*) ! ((\w+:) \w+)(\w+(\s*= \s*(([""'])(\\[""'tbnr]|[^\7])* \7|\w+)|.{0})|\s)* (\/\s*) >",
@"([\r\n])[\s]+",
@"&(quot|#34);",
@"&(amp|#38);",
@"&(lt|#60);",
@"&(gt|#62);",
@"&(nbsp|#160);",
@"&(iexcl|#161);",
@"&(cent|#162);",
@"&(pound|#163);",
@"&(copy|#169);",
@"&#(\d+);",
@"-->",
@"<!--.*\n"
};
string [] aryRep = {
"",
"",
"",
"\"",
"&",
"<",
">",
" ",
"\xa1",//chr(161),
"\xa2",//chr(162),
"\xa3",//chr(163),
"\xa9",//chr(169),
"",
"\r\n",
""
};
string newReg = aryReg[0];
string strOutput = strHtml;
for(int i = 0;i<aryReg.Length;i++)
{
Regex regex = new Regex(aryReg[ i ],RegexOptions.IgnoreCase );
strOutput = regex.Replace(strOutput,aryRep[ i ]);
}
strOutput.Replace("<","");
strOutput.Replace(">","");
strOutput.Replace("\r\n","");
return strOutput;
}
</script>

System.Web.HttpException: Request timed out.
Kalidas
oh, Jocular, thanks, then how should I do to solve this problem I am new to C#, with little experience. I just can't figure out why this timed out.
Without a for loop, when I replace the string only for ONCE, the code works fine, but in this "for-loop", it acts up.
Hope you can give me more hint, than you very much!
fripper
The problem is with the second RE and not the for loop itself. If you were to remove the second RE then it'll probably work as expected however I didn't try the remaining expressions.
You need to fix the second RE. I consider myself pretty knowledgable in REs but I can't figure out what this RE is trying to do. It doesn't seem to be really looking for anything that would be remotely useful to you. Could you clarify what this particular RE is supposed to be looking for and then perhaps we can provide you an alternative RE that will work In the meantime remove the second RE and run your app to verify that all the other REs work correctly.
Michael Taylor - 8/11/06
Shamdogg
After second reading of Jocular's comment, I came to realize that what Taylor refers to as "second RE" is actually aryReg[1], right
aryReg is an array of Regular Expressions, with 15 elements. Because there are so many HTML tags, it is not possible to regard all of the tags as just one single RE. So if we want to strip the string of all its HTML tags, we have to perform the strip action for many times, that is why I include all the RE in one array and made a for-loop.
If only one RE, one replacement, there is no problem.
Here we just repeat the same action, each time with a different RE, why problem arises
Mike Hadlow
The second regular expression But I have ONLY one in the code, that is "regex" in line of Regex regex = new Regex(aryReg[ i ],RegexOptions.IgnoreCase );
I have two array, aryReg and aryRep. aryReg stores 15 HTML tag mark, while aryRep stores 15 corresponding symbols which are to replace the HTML tag marks.
Seeing that there are arrays here, so I used a for loop to make sure all kinds of tag marks be replaced properly.
And the final purpose is to strip the string of all its HTML mark.
Thanks.
jhusain
The second regular expression you have specified does look-aheads and whatnot. Honestly I can't figure out what that expression is suppose to match and I believe the parser is having the same problem. It is probably lost in continually trying to find a match for one of the many subexpressions. You should try and remove the look-ahead component and simplify the expression to check for what you want. I would recommend using Regulator (http://regulator.sourceforge.net/) for testing the expressions before trying to use them in a program.
Michael Taylor - 8/10/06
phoenixoxo
Taylor, first thank you very much for your kindness and help!
After reading your advice for several times, now I finally made it clear that whay you refered to as second RE is aryReg[1], which Jocular already reminded me of that! haha, I am rather dumb sometimes.
Now I deleted the second RE, the code seems normal now. Coz I am not familiar enough with RE, and the code is copied from somewhere else as study materials, I am not sure what the second RE exactly want to do. hehe.
Thank you very much for your guidance. Your advice is of great help to me!
Johan Andersson
The second aryReg expression will cause the RE parser to deadlock.
The other aryReg expressions are valid but will not match anything in your sample string. Therefore if you remove the second aryReg expression the code will run but you will find that the string is unchanged. The fact that you are reusing the same RE object for multiple expressions is okay and perfectly legal.
Honestly, for simple HTML text your approach would work but a fundamental problem with REs is that they are context insensitive. Therefore I could easily pass you strings that are valid HTML but once your method returns is no longer valid. For example the following string will get mangled given your code:
<SCRIPT><!--</SCRIPT>-->\n</SCRIPT>
It simply isn't possible with REs to do context sensitive parsing of strings. You could come up with convoluted expressions to catch some of the issues but it'd be real difficult to write or understand. Your better option is to use an HTML parser that understands the rules of HTML. This ensures that the parsing is accurate and would greatly simplify your code. There are parser available online. I believe .NET itself has a parser as well but I believe it is internal to the framework so is unavailable.
You could continue to use REs but I would recommend that you alter your approach. First find all occurrences of the comment characters. Once you've found it then strip out everything until you find the end comment characters. You have then eliminated any potentially invalid HTML code from the string. You could use the grouping options in RE to match the comment bodies or you could simply parse the string manually.
Once you've eliminated comments do the simple replacement of HTML characters. Technically you can use Uri.EscapeDataString to convert standard HTML-encoded characters back to their actual form.
Your final step is to strip out any HTML elements you don't want. You can use REs here to match the beginning and end elements and remove everything in the middle or you can simply do a search for the start token and skip everything until you reach the end token.
IMHO,
Michael Taylor - 8/12/06
Pratyush
aryReg is an array of regular expressions. The second regular expression Michael Taylor was referring to is aryReg[1]: @"<(\/\s*) ! ((\w+:) \w+)(\w+(\s*= \s*(([""'])(\\[""'tbnr]|[^\7])* \7|\w+)|.{0})|\s)* (\/\s*) >"
If you step through your code with a debugger you'll see that this is the expression that is timing out.
hanguyen