Extracting and Manipulating Strings in C#.Net

Question

We have a requirement to extract and manipulate strings in C#. Net. The requirement is - we have a string

($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:("[email protected]"))

We need to extract the strings between the character - $

Therefore, in the end, we need to get a list of strings containing - name, phonenumber, emailaddress.

What would be the ideal way to do it? are there any out of the box features available for this?

Regards,

John

That's not extracting, that's parsing. It's simple enough though that it can be performed with a regular expression, eg @"\$\w+\$" — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jul 13, 2017 at 16:00
Split the string on $ and take every odd numbered occurrence in the resulting enumerable (i.e. 1st, 3rd, 5th etc) :) — DavidG
– DavidG, Commented Jul 13, 2017 at 16:01
@DavidG that's slower and more complex than a regex. It generates a lot of temporary strings too — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Jul 13, 2017 at 16:02
I will go with regex as well, but what have you try so far @Silly John? — Amr Elgarhy
– Amr Elgarhy, Commented Jul 13, 2017 at 16:02
@PanagiotisKanavos I never claimed it was fast, and many people would say that regex is more complex (if you don't have any understanding of it) — DavidG
– DavidG, Commented Jul 13, 2017 at 16:02

Panagiotis Kanavos · Accepted Answer · 2017-07-13 17:56:45Z

The simplest way is to use a regular expression to match all non-whitespace characters between $ :

var regex=new Regex(@"\$\w+\$");
var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"[email protected]\"))";

var matches=regex.Matches(input);

This will return a collection of matches. The .Value property of each match contains the matching string. \$ is used because $ has special meaning in regular expressions - it matches the end of a string. \w means a non-whitespace character. + means one or more.

Since this is a collection, you can use LINQ on it to get eg an array with the values:

var values=matches.OfType<Match>().Select(m=>m.Value).ToArray();

That array will contain the values $name$ , $phonenumer$ , $emailaddress$ .

Capture by name

You can specify groups in the pattern and attach names to them. For example, you can group the field name values:

var regex=new Regex(@"\$(?<name>\w+)\$");
var names=regex.Matches(input)
                .OfType<Match>()
                .Select(m=>m.Groups["name"].Value);

This will return name,phonenumer,emailaddress. Parentheses are used for grouping. (?<somename>pattern) is used to attach a name to the group

Extract both names and values

You can also capture the field values and extract them as a separate field. Once you have the field name and value, you can return them, eg as an object or anonymous type.

The pattern in this case is more comples:

@"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)"

Parentheses are escaped because we want them to match the values. Both ' and " characters are used in values, so ['"] is used to specify a choice of characters. The pattern is a literal string (ie starts with @) so the double quotes have to be escaped: ['""] . Any character has to be matched .+ but only up to the next character in the pattern .+?. Without the ? the pattern .+ would match everything to the end of the string.

Putting this together:

var regex =  new Regex(@"\$(?<name>\w+)\$:\(['""](?<value>.+?)['""]\)");
var myValues = regex.Matches(input)
          .OfType<Match>()
          .Select(m=>new {  Name=m.Groups["name"].Value, 
                            Value=m.Groups["value"].Value
            })
          .ToArray()

Turn them into a dictionary

Instead of ToArray() you could convert the objects to a dictionary with ToDictionary(), eg with .ToDictionary(it=>it.Name,it=>it.Value). You could omit the select step and generate the dictionary from the matches themselves :

var myDict = regex.Matches(input)
          .OfType<Match>()
          .ToDictionary(m=>m.Groups["name"].Value, 
                        m=>m.Groups["value"].Value);

Regular expressions are generally fast because they don't split the string. The pattern is converted to efficient code that parses the input and skips non-matching input immediatelly. Each match and group contain only the index to their starting and ending character in the input string. A string is only generated when .Value is called.

Regular expressions are thread-safe, which means a single Regex object can be stored in a static field and reused from multiple threads. That helps in web applications, as there's no need to create a new Regex object for each request

Because of these two advantages, regular expressions are used extensively to parse log files and extract specific fields. Compared to splitting, performance can be 10 times better or more, while memory usage remains low. Splitting can easily result in memory usage that's multiple times bigger than the original input file.

Can it go faster?

Yes. Regular expressions produce parsing code that may not be as efficient as possible. A hand-written parser could be faster. In this particular case, we want to start capturing text if $ is detected up until the first $. This can be done with the following method :

IEnumerable<string> GetNames(string input)
{
    var builder=new StringBuilder(20);
    bool started=false;
    foreach(var c in input)
    {        
        if (started)
        {
            if (c!='$')
            {
                builder.Append(c);
            }
            else
            {
                started=false;
                var value=builder.ToString();
                yield return value;
                builder.Clear();
            }
        }
        else if (c=='$')
        {
            started=true;
        }        
    }
}

A string is an IEnumerable<char> so we can inspect one character at a time without having to copy them. By using a single StringBuilder with a predetermined capacity we avoid reallocations, at least until we find a key that's larger than 20 characters.

Modifying this code to extract values though isn't so easy.

littlerufe · Accepted Answer · 2017-07-13 16:38:01Z

Here's one way to do it, but certainly not very elegant. Basically splitting the string on the '$' and taking every other item will give you the result (after some additional trimming of unwanted characters).

In this example, I'm also grabbing the value of each item and then putting both in a dictionary:

var input = "($name$:('George') AND $phonenumer$:('456456') AND $emailaddress$:(\"[email protected]\"))";
var inputParts = input.Replace(" AND ", "")
    .Trim(')', '(')
    .Split(new[] {'$'}, StringSplitOptions.RemoveEmptyEntries);

var keyValuePairs = new Dictionary<string, string>();

for (int i = 0; i < inputParts.Length - 1; i += 2)
{
    var key = inputParts[i];
    var value = inputParts[i + 1].Trim('(', ':', ')', '"', '\'', ' ');

    keyValuePairs[key] = value;
}

foreach (var kvp in keyValuePairs)
{
    Console.WriteLine($"{kvp.Key} = {kvp.Value}");
}

// Wait for input before closing
Console.WriteLine("\nDone!\nPress any key to exit...");
Console.ReadKey();

Output

Collectives™ on Stack Overflow

Extracting and Manipulating Strings in C#.Net

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related