17

I would like to split a string on a comma, but ignore cases when it is within quotation marks:

for example:

teststring = '48, "one, two", "2011/11/03"'
teststring.split(",")
['48', ' "one', ' two"', ' "2011/11/03"']

and the output I would like is:

['48', ' "one, two"', ' "2011/11/03"']

Is this possible?

6 Answers 6

31

The csv module will work if you set options to handle this dialect:

>>> import csv
>>> teststring = '48, "one, two", "2011/11/03"'
>>> for line in csv.reader([teststring], skipinitialspace=True):
    print line


['48', 'one, two', '2011/11/03']
Sign up to request clarification or add additional context in comments.

1 Comment

+1: nice catch, for this skipinitialspace! I tried to understand the csv documentation but could not get the OP input to work :)
9

You can use the csv module from the standard library:

>>> import csv
>>> testdata = ['48, "one, two", "2011/11/03"']
>>> testcsv = csv.reader(testdata,skipinitialspace=True)
>>> testcsv.next()
['48', 'one, two', '2011/11/03']

The one thing to watch out for is that the csv.reader objects expect an iterator which will return a string each time next() is called. This means that you can't pass a string string straight to a reader(), but you can enclose it in a list as above.

You'll have to be careful with the format of your data or tell csv how to handle it. By default the quotes have to come immediately after the comma or the csv module will interpret the field as beginning with a space rather than being quoted. You can fix this using the skipinitialspace option.

8 Comments

This does not solve the OP's problem. "one, two" should not be splitted, because the comma is within the quotes, or do I misinterpret something? I tried this here by my own and got the same result as you, reading the doc csv, I understood that per default it should tread everything inside quotes as one field, per default.
@Dave webb: Djmac wants "one, two" in single variable which not the case in your answer...he requires output as ['48', ' "one, two"', ' "2011/11/03"'], length = 3 in your case length=4
@stema - Good point! I didn't read the output of my code carefully enough. It turns out the problem is with the sample data. If a field starts with a space then csv assumes the field does too and the " is part of the field, i.e. csv does not automatically trim each value. I've fixed the sample data and the code now works. Thanks for point this out.
@avasal - as noted above the problem was with the sample data (kind of) rather the code. Or rather, if you're going to use csv you have to be a bit more careful with your data format. Thanks for the help.
Great, now its working! Hopefully also for the OP. +1 from me.
|
7

You can use shlex module to parse your string.

By default, shlex.split will split your string at whitespace characters not enclosed in quotes:

>>> shlex.split(teststring)
['48,', 'one, two,', '2011/11/03']

This doesn't removes the trailing commas from your string, but it's close to what you need. However, if you customize the parser to consider the comma as a whitespace character, then you'll get the output that you need:

>>> parser = shlex.shlex(teststring)
>>> parser.whitespace
' \t\r\n'
>>> parser.whitespace += ','
>>> list(parser)
['48', '"one, two"', '"2011/11/03"']

Note: the parser object is used as an iterator to get the tokens one by one. Hence, list(parser) iterates over the parser object and returns the string splitted where you need.

1 Comment

This gets the job done, but isn't as good of a fit as the csv module.
7

This is not a standard module, you have to install it via pip, but as an option try tssplit:

In [5]: from tssplit import tssplit 
In [6]: tssplit('48, "one, two", "2011/11/03"', quote='"', delimiter=',', trim=' ')
Out[6]: ['48', 'one, two', '2011/11/03']

Comments

3

You should use the Python csv library: http://docs.python.org/library/csv.html

1 Comment

That link isn't enough to solve the problem. Right out of the box, a csv reader won't correctly parse the OP's test string.
1
import shlex
teststring = '48, "one, two", "2011/11/03"'
output = shlex.split(teststring)
output = [re.sub(r",$","",w) for w in output]
print output
['48', 'one, two', '2011/11/03']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.