2

I have a CSV file, with the following type of data:

0,'VT,C',0,
0,'C,VT',0,
0,'VT,H',0,

and I desire the following output

0
VT,C
0
0
C,VT
0
0
VT,H
0

Therefore splitting the string on the comma however ignoring the comma within quote marks. At the moment I'm using the following RegEx:

("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)"

however this gives me the result of:

0
VT
C
0
0
C
VT
0
0
VT
H
0

This show the RegEx is not reading the quote mark properly. Can anyone suggest some alterations that might help?

1

4 Answers 4

1

Usually when it comes to CSV parsing, people use specific libraries well suited for the programming language they are using to code their application.

Anyway if you are going to use a regular expression to make a really loose(!) parsing you may try using something like this:

'(?<value>[^']*?)'

It will match anything in between single quotes, and assuming the csv file is well formed, it will not miss a field. Of course it doesn't accept embedded quotes but it easily gets the job done. That's what I use when I need to get the job done really quickly. Please don't consider it a complete solution to your problem...it just works in special conditions when the requirements are what you described and the input is well formed.

[EDIT]

I was checking again your question and noticed you want to include also non quoted fields...well ok in that case my expression will not work at all. Anyway listen...if you think hard about your problem, you'll find that's something quite difficult to solve without ambiguity. Because you need fixed rules and if you allow quoted and not quoted fields, the parser will have hard time figuring out legit commas as separator/quoted.

Another expression to model such a solution may be:

('[^']+'|[^,]+),?

It will match both quoted/notquoted fields...anyway I'm not sure if it needs to assume the csv HAS to adhere to strict conditions. That will work much safer then a split strategy as far as I can tell ... you just need to collect all matches and print the matched_value + \r\n on your target string.

Sign up to request clarification or add additional context in comments.

Comments

0

This regex is based of the fact you have 1 digit before and after your 'value'

Regex.Replace(input, @"(?:(?<=\d),|,(?=\d))", "\n");

You can test it out on RegexStorm

Comments

0
foreach(var m in Regex.Matches(s,"(('.*?')|[0-9])"))

Comments

0

I have manages to get the following method to read the file as required:

public List<string> SplitCSV(string input, List<string> line)
    {

        Regex csvSplit = new Regex("(([^,^\'])*(\'.*\')*([^,^\'])*)(,|$)", RegexOptions.Compiled);

        foreach (Match match in csvSplit.Matches(input))
        {
            line.Add(match.Value.TrimStart(','));
        }
        return line; 
    }

Thanks for everyone help though.

2 Comments

Actually that doesn't compile because you should add the value to the hot List...and should use TrimEnd() not TrimStart(). That uses the strategy I suggested but a different regular expression. your expression doesn't consider cases not strictly the same as your samples above. That's why I wrote a more general expression. Anyway you seem to ask a question and then leave the discussion going for your own path. Hope your solution will not fail on further cases.
What is input supposed to represent?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.