0

We have a requirement to extract and manipulate strings in C#. Net. The requirement is - we have a string

($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:("[email protected]"))

We need to extract the strings between the character - $

Therefore, in the end, we need to get a list of strings containing - name, phonenumber, emailaddress.

What would be the ideal way to do it? are there any out of the box features available for this?

Regards,

John

19
  • 3
    That's not extracting, that's parsing. It's simple enough though that it can be performed with a regular expression, eg @"\$\w+\$" Commented Jul 13, 2017 at 16:00
  • Split the string on $ and take every odd numbered occurrence in the resulting enumerable (i.e. 1st, 3rd, 5th etc) :) Commented Jul 13, 2017 at 16:01
  • @DavidG that's slower and more complex than a regex. It generates a lot of temporary strings too Commented Jul 13, 2017 at 16:02
  • 1
    I will go with regex as well, but what have you try so far @Silly John? Commented Jul 13, 2017 at 16:02
  • @PanagiotisKanavos I never claimed it was fast, and many people would say that regex is more complex (if you don't have any understanding of it) Commented Jul 13, 2017 at 16:02

2 Answers 2

1

The simplest way is to use a regular expression to match all non-whitespace characters between $ :

var regex=new Regex(@"\$\w+\$");
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"[email protected]\"))";

var matches=regex.Matches(input);

This will return a collection of matches. The .Value property of each match contains the matching string. \$ is used because $ has special meaning in regular expressions - it matches the end of a string. \w means a non-whitespace character. + means one or more.

Since this is a collection, you can use LINQ on it to get eg an array with the values:

var values=matches.OfType<Match>().Select(m=>m.Value).ToArray();

That array will contain the values $name$,$phonenumer$,$emailaddress$.

Capture by name

You can specify groups in the pattern and attach names to them. For example, you can group the field name values:

var regex=new Regex(@"\$(?<name>\w+)\$");
var names=regex.Matches(input)
                .OfType<Match>()
                .Select(m=>m.Groups["name"].Value);

This will return name,phonenumer,emailaddress. Parentheses are used for grouping. (?<somename>pattern) is used to attach a name to the group

Extract both names and values

You can also capture the field values and extract them as a separate field. Once you have the field name and value, you can return them, eg as an object or anonymous type.

The pattern in this case is more comples:

@"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)"

Parentheses are escaped because we want them to match the values. Both ' and " characters are used in values, so ['"] is used to specify a choice of characters. The pattern is a literal string (ie starts with @) so the double quotes have to be escaped: ['""] . Any character has to be matched .+ but only up to the next character in the pattern .+?. Without the ? the pattern .+ would match everything to the end of the string.

Putting this together:

var regex =  new Regex(@"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)");
var myValues = regex.Matches(input)
          .OfType<Match>()
          .Select(m=>new {  Name=m.Groups["name"].Value, 
                            Value=m.Groups["value"].Value
            })
          .ToArray()

Turn them into a dictionary

Instead of ToArray() you could convert the objects to a dictionary with ToDictionary(), eg with .ToDictionary(it=>it.Name,it=>it.Value). You could omit the select step and generate the dictionary from the matches themselves :

var myDict = regex.Matches(input)
          .OfType<Match>()
          .ToDictionary(m=>m.Groups["name"].Value, 
                        m=>m.Groups["value"].Value);

Regular expressions are generally fast because they don't split the string. The pattern is converted to efficient code that parses the input and skips non-matching input immediatelly. Each match and group contain only the index to their starting and ending character in the input string. A string is only generated when .Value is called.

Regular expressions are thread-safe, which means a single Regex object can be stored in a static field and reused from multiple threads. That helps in web applications, as there's no need to create a new Regex object for each request

Because of these two advantages, regular expressions are used extensively to parse log files and extract specific fields. Compared to splitting, performance can be 10 times better or more, while memory usage remains low. Splitting can easily result in memory usage that's multiple times bigger than the original input file.

Can it go faster?

Yes. Regular expressions produce parsing code that may not be as efficient as possible. A hand-written parser could be faster. In this particular case, we want to start capturing text if $ is detected up until the first $. This can be done with the following method :

IEnumerable<string> GetNames(string input)
{
    var builder=new StringBuilder(20);
    bool started=false;
    foreach(var c in input)
    {        
        if (started)
        {
            if (c!='$')
            {
                builder.Append(c);
            }
            else
            {
                started=false;
                var value=builder.ToString();
                yield return value;
                builder.Clear();
            }
        }
        else if (c=='$')
        {
            started=true;
        }        
    }
}

A string is an IEnumerable<char> so we can inspect one character at a time without having to copy them. By using a single StringBuilder with a predetermined capacity we avoid reallocations, at least until we find a key that's larger than 20 characters.

Modifying this code to extract values though isn't so easy.

Sign up to request clarification or add additional context in comments.

Comments

0

Here's one way to do it, but certainly not very elegant. Basically splitting the string on the '$' and taking every other item will give you the result (after some additional trimming of unwanted characters).

In this example, I'm also grabbing the value of each item and then putting both in a dictionary:

var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"[email protected]\"))";
var inputParts = input.Replace(" AND ", "")
    .Trim(')', '(')
    .Split(new[] {'$'}, StringSplitOptions.RemoveEmptyEntries);

var keyValuePairs = new Dictionary<string, string>();

for (int i = 0; i < inputParts.Length - 1; i += 2)
{
    var key = inputParts[i];
    var value = inputParts[i + 1].Trim('(', ':', ')', '"', '\'', ' ');

    keyValuePairs[key] = value;
}

foreach (var kvp in keyValuePairs)
{
    Console.WriteLine($"{kvp.Key} = {kvp.Value}");
}

// Wait for input before closing
Console.WriteLine("\nDone!\nPress any key to exit...");
Console.ReadKey();

Output

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.