C# Replace with regex

Question

I'm new to VB, C#, and am struggling with regex. I think I've got the following code format to replace the regex match with blank space in my file.

EDIT: Per comments this code block has been changed.

var fileContents = System.IO.File.ReadAllText(@"C:\path\to\file.csv");

~~fileContents = fileContents.Replace(fileContents, @"regex", "");~~

regex = new Regex(pattern);
regex.Replace(filecontents, "");
System.IO.File.WriteAllText(@"C:\path\to\file.csv", fileContents);

My files are formatted like this:

"1111111","22222222222","Text that may, have a comma, or two","2014-09-01",,,,,,

So far, I have regex finding any string between ," and ", that contains a comma (there are never commas in the first or last cell, so I'm not worried about excluding those two. I'm testing regex in Expresso

(?<=,")([^"]+,[^"]+)(?=",)

I'm just not sure how to isolate that comma as what needs to be replaced. What would be the best way to do this?

SOLVED: Combined [^"]+ with look behind/ahead:

(?<=,"[^"]+)(,)(?=[^"]+",)

FINAL EDIT: Here's my final complete solution:

//read file contents
var fileContents = System.IO.File.ReadAllText(@"C:\path\to\file.csv");

//find all commas between double quotes
var regex = new Regex("(?<=,\")([^\"]+,[^\"]+(?=\",)");

//replace all commas with ""
fileContents = regex.Replace(fileContents, m => m.ToString().Replace(",", ""));

//write result back to file
System.IO.File.WriteAllText(@"C:\path\to\file.csv", fileContents);

Same question for for Java: stackoverflow.com/questions/1757065/… — D Stanley
– D Stanley, Commented Dec 10, 2014 at 17:30
Filecontents.Replace does not regex replace for starters. You create a Regex regex = new Regex(pattern); then you do regex.Replace(filecontents, replacement); — Florian Schmidinger
– Florian Schmidinger, Commented Dec 10, 2014 at 17:30
@FlorianSchmidinger thanks for that explanation, I'll try it that way ,but still need to figure out the correct regex — KingRichard
– KingRichard, Commented Dec 10, 2014 at 17:33
@RichardN - When you use that regex it only finds a single character that it replaces. The match evaluator delegate is an expensive callback that's primary purpose is to do a sub-replacement on a main general replacement. Using the same regex, try this Console.WriteLine(Regex.Replace(@",""one, two"",", "(?<=,\"[^\"]+),(?=[^\"]+\",)", "")); then this Console.WriteLine(Regex.Replace(@",""one, two"",", "(?<=,\"[^\"]+),(?=[^\"]+\",)", m => m.ToString().Replace(",", ""))); — user557597
– user557597, Commented Dec 10, 2014 at 22:34

KingRichard · Accepted Answer · 2014-12-10 17:50:52Z

1

Figured it out by combining the [^"]+ with the look ahead ?= and look behind ?<= so that it finds strings beginning with ,"[anything that's not double quotes, one or more times] then has a comma, then ends with [anything that's not double quotes, one or more times]",

(?<=,"[^"]+)(,)(?=[^"]+",)

answered Dec 10, 2014 at 17:50

KingRichard

1,2542 gold badges14 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user557597 Over a year ago

This works ok. You could even use (?<=,"[^"]*),(?=[^"]*",) to handle edge cases like delimiter",middle,"delimiter. +1

KingRichard Over a year ago

Yeah, I guess that would work too. It will never happen as the files I'm dealing with are auto generated in a specific format, the , inside the field only appears in numbers such as 10,000 or 1,000,000. I guess I could even use (?=[0-9]+),(?=[0-9]+)

user557597 Over a year ago

There you go, that makes sense.

Oggy · Accepted Answer · 2017-02-17 19:58:28Z

1

Try to parse out all your columns with this:

 Regex regex = new Regex("(?<=\").*?(?=\")");

Then you can just do:

 foreach(Match match in regex.Matches(filecontents))
 {
      fileContents = fileContents.Replace(match.ToString(), match.ToString().Replace(",",string.Empty))
 }

Might not be as fast but should work.

edited Feb 17, 2017 at 19:58

Oggy

1,6761 gold badge17 silver badges22 bronze badges

answered Dec 10, 2014 at 17:49

Florian Schmidinger

4,6822 gold badges19 silver badges28 bronze badges

Comments

Mark Peters · Accepted Answer · 2014-12-10 18:05:29Z

0

I would probably use the overload of Regex.Replace that takes a delegate to return the replaced text. This is useful when you have a simple regex to identify the pattern but you need to do something less straightforward (complex logic) for the replace.

I find keeping your regexes simple will pay benefits when you're trying to maintain them later.

Note: this is similar to the answer by @Florian, but this replace restricts itself to replacement only in the matched text.

string exp = "(?<=,\")([^\"]+,[^\"]+)(?=\",)";
var regex = new Regex(exp); 
string replacedtext = regex.Replace(filecontents, m => m.ToString().Replace(",",""))

edited Dec 10, 2014 at 18:05

answered Dec 10, 2014 at 18:00

Mark Peters

17.9k2 gold badges24 silver badges17 bronze badges

1 Comment

KingRichard Over a year ago

would input in this case be filecontents?

Martin Brown · Accepted Answer · 2014-12-10 18:27:11Z

What you have there is an irregular language. This is because a comma can mean different things depending upon where it is in the text stream. Strangely Regular Expressions are designed to parse regular languages where a comma would mean the same thing regardless of where it is in the text stream. What you need for an irregular language is a parser. In fact Regular expressions are mostly used for tokenizing strings before they are entered into a parser.

While what you are trying to do can be done using regular expressions it is likely to be very slow. For example you can use the following (which will work even if the comma is the first or last character in the field). However every time it finds a comma it will have to scan backwards and forwards to check if it is between two quotation characters.

 (?<=,"[^"]*),(?=[^"]*",)

Note also that their may be a flaw in this approach that you have not yet spotted. I don't know if you have this issue but often in CSV files you can have quotation characters in the middle of fields where there may also be a comma. In these cases applications like MS Excel will typically double the quote up to show that it is not the end of the field. Like this:

"1111111","22222222222","Text that may, have a comma, Quote"" or two","2014-09-01",,,,,,

In this case you are going to be out of luck with a regular expression.

Thankfully the code to deal with CSV files is very simple:

    public static IList<string> ParseCSVLine(string csvLine)
    {
        List<string> result = new List<string>();
        StringBuilder buffer = new StringBuilder();

        bool inQuotes = false;
        char lastChar = '\0';

        foreach (char c in csvLine)
        {
            switch (c)
            {
                case '"':
                    if (inQuotes)
                    {
                        inQuotes = false;
                    }
                    else
                    {
                        if (lastChar == '"')
                        {
                            buffer.Append('"');
                        }
                        inQuotes = true;
                    }
                    break;

                case ',':
                    if (inQuotes)
                    {
                        buffer.Append(',');
                    }
                    else
                    {
                        result.Add(buffer.ToString());
                        buffer.Clear();
                    }
                    break;

                default:
                    buffer.Append(c);
                    break;
            }

            lastChar = c;
        }
        result.Add(buffer.ToString());
        buffer.Clear();

        return result;
    }

PS. There are another couple of issues often run into with CSV files which the code I have given doesn't solve. First is what happens if a field has an end of line character in the middle of it? Second is how do you know what character encoding a CSV file is in? The former of these two issues is easy to solve by modifying my code slightly. The second however is near impossible to do without coming to some agreement with the person supplying the file to you.

Thanks for all the details here. It was very informational. Just to clarify, my regex is (?<=,"[^"]+),(?=[^"]+",) using + instead of * so that it requires one or more chars between the ," and ,

Collectives™ on Stack Overflow

C# Replace with regex

4 Answers 4

3 Comments

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related