2

I'm new to VB, C#, and am struggling with regex. I think I've got the following code format to replace the regex match with blank space in my file.

EDIT: Per comments this code block has been changed.

var fileContents = System.IO.File.ReadAllText(@"C:\path\to\file.csv");

fileContents = fileContents.Replace(fileContents, @"regex", "");

regex = new Regex(pattern);
regex.Replace(filecontents, "");
System.IO.File.WriteAllText(@"C:\path\to\file.csv", fileContents);

My files are formatted like this:

"1111111","22222222222","Text that may, have a comma, or two","2014-09-01",,,,,,

So far, I have regex finding any string between ," and ", that contains a comma (there are never commas in the first or last cell, so I'm not worried about excluding those two. I'm testing regex in Expresso

(?<=,")([^"]+,[^"]+)(?=",)

I'm just not sure how to isolate that comma as what needs to be replaced. What would be the best way to do this?

SOLVED: Combined [^"]+ with look behind/ahead:

(?<=,"[^"]+)(,)(?=[^"]+",)

FINAL EDIT: Here's my final complete solution:

//read file contents
var fileContents = System.IO.File.ReadAllText(@"C:\path\to\file.csv");

//find all commas between double quotes
var regex = new Regex("(?<=,\")([^\"]+,[^\"]+(?=\",)");

//replace all commas with ""
fileContents = regex.Replace(fileContents, m => m.ToString().Replace(",", ""));

//write result back to file
System.IO.File.WriteAllText(@"C:\path\to\file.csv", fileContents);
11
  • Same question for for Java: stackoverflow.com/questions/1757065/… Commented Dec 10, 2014 at 17:30
  • Filecontents.Replace does not regex replace for starters. You create a Regex regex = new Regex(pattern); then you do regex.Replace(filecontents, replacement); Commented Dec 10, 2014 at 17:30
  • @DStanley I'm not trying to split the string Commented Dec 10, 2014 at 17:33
  • @FlorianSchmidinger thanks for that explanation, I'll try it that way ,but still need to figure out the correct regex Commented Dec 10, 2014 at 17:33
  • 1
    @RichardN - When you use that regex it only finds a single character that it replaces. The match evaluator delegate is an expensive callback that's primary purpose is to do a sub-replacement on a main general replacement. Using the same regex, try this Console.WriteLine(Regex.Replace(@",""one, two"",", "(?<=,\"[^\"]+),(?=[^\"]+\",)", "")); then this Console.WriteLine(Regex.Replace(@",""one, two"",", "(?<=,\"[^\"]+),(?=[^\"]+\",)", m => m.ToString().Replace(",", ""))); Commented Dec 10, 2014 at 22:34

4 Answers 4

1

Figured it out by combining the [^"]+ with the look ahead ?= and look behind ?<= so that it finds strings beginning with ,"[anything that's not double quotes, one or more times] then has a comma, then ends with [anything that's not double quotes, one or more times]",

(?<=,"[^"]+)(,)(?=[^"]+",)

Sign up to request clarification or add additional context in comments.

3 Comments

This works ok. You could even use (?<=,"[^"]*),(?=[^"]*",) to handle edge cases like delimiter",middle,"delimiter. +1
Yeah, I guess that would work too. It will never happen as the files I'm dealing with are auto generated in a specific format, the , inside the field only appears in numbers such as 10,000 or 1,000,000. I guess I could even use (?=[0-9]+),(?=[0-9]+)
There you go, that makes sense.
1

Try to parse out all your columns with this:

 Regex regex = new Regex("(?<=\").*?(?=\")");

Then you can just do:

 foreach(Match match in regex.Matches(filecontents))
 {
      fileContents = fileContents.Replace(match.ToString(), match.ToString().Replace(",",string.Empty))
 }

Might not be as fast but should work.

Comments

0

I would probably use the overload of Regex.Replace that takes a delegate to return the replaced text. This is useful when you have a simple regex to identify the pattern but you need to do something less straightforward (complex logic) for the replace.

I find keeping your regexes simple will pay benefits when you're trying to maintain them later.

Note: this is similar to the answer by @Florian, but this replace restricts itself to replacement only in the matched text.

string exp = "(?<=,\")([^\"]+,[^\"]+)(?=\",)";
var regex = new Regex(exp); 
string replacedtext = regex.Replace(filecontents, m => m.ToString().Replace(",",""))

1 Comment

would input in this case be filecontents?
0

What you have there is an irregular language. This is because a comma can mean different things depending upon where it is in the text stream. Strangely Regular Expressions are designed to parse regular languages where a comma would mean the same thing regardless of where it is in the text stream. What you need for an irregular language is a parser. In fact Regular expressions are mostly used for tokenizing strings before they are entered into a parser.

While what you are trying to do can be done using regular expressions it is likely to be very slow. For example you can use the following (which will work even if the comma is the first or last character in the field). However every time it finds a comma it will have to scan backwards and forwards to check if it is between two quotation characters.

 (?<=,"[^"]*),(?=[^"]*",)

Note also that their may be a flaw in this approach that you have not yet spotted. I don't know if you have this issue but often in CSV files you can have quotation characters in the middle of fields where there may also be a comma. In these cases applications like MS Excel will typically double the quote up to show that it is not the end of the field. Like this:

"1111111","22222222222","Text that may, have a comma, Quote"" or two","2014-09-01",,,,,,

In this case you are going to be out of luck with a regular expression.

Thankfully the code to deal with CSV files is very simple:

    public static IList<string> ParseCSVLine(string csvLine)
    {
        List<string> result = new List<string>();
        StringBuilder buffer = new StringBuilder();

        bool inQuotes = false;
        char lastChar = '\0';

        foreach (char c in csvLine)
        {
            switch (c)
            {
                case '"':
                    if (inQuotes)
                    {
                        inQuotes = false;
                    }
                    else
                    {
                        if (lastChar == '"')
                        {
                            buffer.Append('"');
                        }
                        inQuotes = true;
                    }
                    break;

                case ',':
                    if (inQuotes)
                    {
                        buffer.Append(',');
                    }
                    else
                    {
                        result.Add(buffer.ToString());
                        buffer.Clear();
                    }
                    break;

                default:
                    buffer.Append(c);
                    break;
            }

            lastChar = c;
        }
        result.Add(buffer.ToString());
        buffer.Clear();

        return result;
    }

PS. There are another couple of issues often run into with CSV files which the code I have given doesn't solve. First is what happens if a field has an end of line character in the middle of it? Second is how do you know what character encoding a CSV file is in? The former of these two issues is easy to solve by modifying my code slightly. The second however is near impossible to do without coming to some agreement with the person supplying the file to you.

1 Comment

Thanks for all the details here. It was very informational. Just to clarify, my regex is (?<=,"[^"]+),(?=[^"]+",) using + instead of * so that it requires one or more chars between the ," and ,

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.