Underscore Problem?

Ok, I'm trying to write what should be a simple regex to parse some information out f emails. The problem is well... strange. It'll be easiest to show by example.

Here's the overall regexp:

name:\s( <contact>[\w\s]*)\nemail:\s( <contactEmail>.*)[.\n]*community_name:\s( <location>[\w\s]*)\n

Here is the text I am trying to match:

name: Bob Bobson
email: name@address.com
community_i_live_in: apartment
community_name: Vanilla Hills
address: 101 Sprinkles St
apartment: 302
username1: bbobson
password11: texas1
password12: texas1
username2:
password21:
password22:
phone: 555-555-5555
corporate_apartment: no
corporate_id:

Now, the regexp doesn't match this. However if I take part of the regexp and run the match it works. For instance if I use this:

name:\s( <contact>[\w\s]*)\nemail:\s( <contactEmail>.*)[.\n]*community_

It will match everything properly up to that point. To make things strange, if I go one character further the match fails:

name:\s( <contact>[\w\s]*)\nemail:\s( <contactEmail>.*)[.\n]*community_n

I can match the end of the string though, including the apparent failure point:

[.\n]*community_name:\s( <location>[\w\s]*)\n

It will match properly too.

So basially, WTF




Answer this question

Underscore Problem?

  • Tomsi

    Telos wrote:
    I still don't understand what the original problem was though. Is there any actual reason that the original regex I wrote shouldn't work


    Its actually working...just not as you expect. It is the whitespace of /r/n and the use of \s and \n and possibly .*. For /s can capture a carriage return and a line feed, for its white space after all and their are two items, /r/n at the end of each line. Regex is dutifully capturing those as per specificiation leading to non matches occurring.

    The pattern provided mixes them such as [\s]*)\n and also the (.*) which can match \n in some cases depending on the settings, as mentioned in my post. Due to that mixing, the regex engine is capturing items in the match which you are not expecting and causing a failure for the whole pattern. Hence why the pattern retrieves data in some instances and not in other.

    I recommend that the pattern be rewritten to anticipate the appropriate white space instead of being affected by it. Look at the answer provided by Sergei Z as a resource in this post Specific Paragraph Matching with Inner Matches on the handling the whitespace of the lines.




  • LukeMancev

    Do you use the regex option Singleline which will allow the match of the \n for the (.) when set, which could simplify your regex ...


  • JonEbersole

    Ok, I had tried single line but that made it harder to get things matched up right, since I only knew the end of the line by the newline character. Single line only would have meant writing in every field even if I didn't care about it, and that's just too much work. ;)

    Thanks for your help!



  • Keith Hill

    Telos wrote:
    Single line only would have meant writing in every field even if I didn't care about it, and that's just too much work. ;)


    Regex is about patterns, but the real power is in developing a generic pattern. You have a pretty easy pattern with fieldname followed by : then data til eof. If one sets Multiline option on and runs this pattern

    Regular Expression^( <Field>[^:]*)( :\:)( <Data>[^\r\n$]*)

    Each match will be an individual line and contain only two groups, Field and Data. Place each of the matches in a Dictionary<string, string> where Field is the key and data is the value. Then all you have to do is access it by string email = myDict["email"];

    Plus if you ever have to access any other field, wala its already there...or if the email changes to add or remove fields, the regex can handle it.


  • KJBalaji

    Daniel Kuppitz wrote:

    Hi Telos,

    I just had too much time..

    Clearly!

    Thanks though, it seems to be working, with the minor addition of a function to return IDictionary instead of IEnumerable. I still don't understand what the original problem was though. Is there any actual reason that the original regex I wrote shouldn't work



  • chandrika3

    Hi Telos,

    I just had too much time..
    Try this:

    using System;
    using System.Collections.Generic;
    using System.Collections.ObjectModel;
    using System.Collections.Specialized;
    using System.Text.RegularExpressions;
    namespace ConsoleApplication1
    {
    class EmailFieldRegex
    {
    private string _fieldName;
    private string _groupName;
    private string _pattern;
    public EmailFieldRegex(string fieldName, string groupName)
    : this(fieldName, groupName, "[^\n]*")
    { }
    public EmailFieldRegex(string fieldName, string groupName, string pattern)
    {
    _fieldName = Regex.Escape(fieldName);
    _groupName = groupName;
    _pattern = pattern;
    }
    public string Pattern
    {
    get { return String.Format(@"({0}:\s( <{1}>{2})\r \n)",
    _fieldName, _groupName, _pattern); }
    }
    internal string GroupName
    {
    get { return _groupName; }
    }
    }
    class EmailFieldRegexCollection : Collection<EmailFieldRegex>
    {
    private string[] _groupNames;
    private Regex BuildExpression()
    {
    string[] expressions;
    int count = base.Items.Count;
    expressions = new string[count];
    _groupNames = new string[count];
    for (int index = 0; index < count; index++)
    {
    expressions[index] = base.Items[index].Pattern;
    _groupNames[index] = base.Items[index].GroupName;
    }
    return new Regex(String.Join("|", expressions));
    }
    public IEnumerable<KeyValuePair<string, string>> Matches(string str)
    {
    Regex regex = BuildExpression();
    Match m = regex.Match(str);
    while (null != m && m.Success)
    {
    foreach (string groupName in _groupNames)
    {
    if (m.Groups[groupName].Success)
    {
    yield return new KeyValuePair<string, string>(
    groupName, m.Groups[groupName].Value);
    }
    }
    m = m.NextMatch();
    }
    }
    }
    class Program
    {
    static void Main(string[] args)
    {
    EmailFieldRegexCollection exp = new EmailFieldRegexCollection();
    exp.Add(new EmailFieldRegex("name", "contact", @"[\w\s]*"));
    exp.Add(new EmailFieldRegex("email", "contactEmail"));
    exp.Add(new EmailFieldRegex("community_name", "location", @"[\w\s]*"));
    string test = @"
    name: Bob Bobson
    email: name@address.com
    community_i_live_in: apartment
    community_name: Vanilla Hills
    address: 101 Sprinkles St
    apartment: 302
    username1: bbobson
    password11: texas1
    password12: texas1
    username2:
    password21:
    password22:
    phone: 555-555-5555
    corporate_apartment: no
    corporate_id: ";
    foreach (KeyValuePair<string, string> kv in exp.Matches(test))
    {
    Console.WriteLine("{0}: {1}", kv.Key, kv.Value);
    }
    }
    }
    }
    --
    Regards,
    Daniel Kuppitz

  • Underscore Problem?