2

I am a total noob to regex. I have a bunch of user agent strings that I want to parse.

Windows Phone Search (Windows Phone OS 7.10;Acer;Allegro;7.10;8860)
Windows Phone Search (Windows Phone OS 7.10;HTC;7 Mozart T8698;7.10;7713)
Windows Phone Search (Windows Phone OS 7.10;HTC;Radar C110e;7.10;7720)

How can I use regex to just extract:

A) Windows Phone OS 7.10 Acer Allegro

B) Windows Phone OS 7.10 HTC 7 Mozart

C) Windows Phone OS 7.10 HTC Radar

I have tried to use Split in the following way but to no avail:

private static string parse(string input) 
{ 
    input = input.Remove(0, input.IndexOf('(') + 1).Replace(')', ' ').Trim(); 
    string[] temp = input.Split(';'); 
    if (temp[2].Contains('T'))
    { 
        temp[2] = temp[2].Substring(0, temp[2].IndexOf('T')).Trim(); 
    } 
    StringBuilder sb = new StringBuilder(); 
    sb.Append(temp[0] + " "); 
    sb.Append(temp[1] + " "); 
    sb.Append(temp[2]); 
    return sb.ToString(); 
}
15
  • 4
    Have you tried anything yet? Any pattern or pseudocode to made up the regex? Commented Jul 11, 2013 at 21:44
  • 1
    I would use IndexOf or Split for this. Commented Jul 11, 2013 at 21:45
  • 4
    Anything wrong with simple String.Split? Or you want to learn regular expressions (than your question should be worded differently)... Commented Jul 11, 2013 at 21:45
  • 2
    Yup, String.Split over ';', a trim on the first match, dropping the last 2 would get you what you want. (well, almost. You'd want to further split on whitespace in the event you get "Mozart T8698") Commented Jul 11, 2013 at 21:47
  • 1
    While I 'd probably go with regex instead of Split here for convenience, the real question is what do you want to happen to strings that do not exactly match these patterns? For example, note that "Allegro" is a token by its own while "Mozart" and "Radar" both have secondary tokens that you don't want to keep. What if you have a UA string with three tokens in that position? Or four? Or none? Commented Jul 11, 2013 at 21:52

2 Answers 2

1

I use regular expressions because it was specifically designed to parse any type of text. Once one understands the basics of the regex patterns it becomes very useful in any text situations.

In this pattern my goal is to separate each item out into named capture groups of Version, Phone, Type, Major an Minor. Once that is done by the regex processing I can use Linq to extract out the data as shown.

string @pattern = @"
(?:OS\s)                     # Match but don't capture (MDC) OS, used an an anchor
(?<Version>\d\.\d+)          # Version of OS
(?:;)                        # MDC ;
(?<Phone>[^;]+)              # Get phone name up to ;
(?:;)                        # MDC ;
(?<Type>[^;]+)               # Get phone type up to ;
(?:;)                        # MDC ;
(?<Major>\d\.\d+)            # Major version
(?:;)
(?<Minor>\d+)                # Minor Version
";

string data =
@"Windows Phone Search (Windows Phone OS 7.10;Acer;Allegro;7.10;8860)
Windows Phone Search (Windows Phone OS 7.10;HTC;7 Mozart T8698;7.10;7713)
Windows Phone Search (Windows Phone OS 7.10;HTC;Radar C110e;7.10;7720)";

 // Ignore pattern white space allows us to comment the pattern, it is not a regex processing command
var phones = Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace)
                  .OfType<Match>()
                  .Select (mt => new
                  {
                    Name = mt.Groups["Phone"].Value.ToString(),
                    Type = mt.Groups["Type"].Value.ToString(),
                    Version = string.Format( "{0}.{1}", mt.Groups["Major"].Value.ToString(),
                                                        mt.Groups["Minor"].Value.ToString())
                  }
                  );

Console.WriteLine ("Phones Supported are:");

phones.Select(ph => string.Format("{0} of type {1} version ({2})", ph.Name, ph.Type, ph.Version))
      .ToList()
      .ForEach(Console.WriteLine);

/* Output
Phones Supported are:
Acer of type Allegro version (7.10.8860)
HTC of type 7 Mozart T8698 version (7.10.7713)
HTC of type Radar C110e version (7.10.7720)
*/
Sign up to request clarification or add additional context in comments.

Comments

1

This regex will capture it:

(?<=\().*?;.*?;.*?(?=;)

As code it would be:

string s = Regex.Match(input, @"(?<=\().*?;.*?;.*?(?=;)").Value

As breakdown of the regex:

  • (?<=\() = a "look behind" that asserts the previous char is a literal open bracket (
  • .*?; = a (non-greedy - won't skip ;) match of everything up to the next ;
  • (?=;) = a "look ahead" that asserts the next char is a literal semi-colon ;

2 Comments

Could you change this regex to a multi line regex that explains each section of the regex? New users may find that more helpful than a one liner that they probably don't understand.
@GeorgeStocker How's that?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.