Split string, ignoring delimiter within quotation marks (python)

Question

I would like to split a string on a comma, but ignore cases when it is within quotation marks:

for example:

teststring = '48, "one, two", "2011/11/03"'
teststring.split(",")
['48', ' "one', ' two"', ' "2011/11/03"']

and the output I would like is:

['48', ' "one, two"', ' "2011/11/03"']

Is this possible?

Raymond Hettinger · Accepted Answer · 2011-11-21 08:06:42Z

31

The csv module will work if you set options to handle this dialect:

>>> import csv
>>> teststring = '48, "one, two", "2011/11/03"'
>>> for line in csv.reader([teststring], skipinitialspace=True):
    print line


['48', 'one, two', '2011/11/03']

answered Nov 21, 2011 at 8:06

Raymond Hettinger

229k67 gold badges405 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Joël Over a year ago

+1: nice catch, for this skipinitialspace! I tried to understand the csv documentation but could not get the OP input to work :)

David Webb · Accepted Answer · 2011-11-21 10:37:48Z

9

You can use the csv module from the standard library:

>>> import csv
>>> testdata = ['48, "one, two", "2011/11/03"']
>>> testcsv = csv.reader(testdata,skipinitialspace=True)
>>> testcsv.next()
['48', 'one, two', '2011/11/03']

The one thing to watch out for is that the csv.reader objects expect an iterator which will return a string each time next() is called. This means that you can't pass a string string straight to a reader(), but you can enclose it in a list as above.

You'll have to be careful with the format of your data or tell csv how to handle it. By default the quotes have to come immediately after the comma or the csv module will interpret the field as beginning with a space rather than being quoted. You can fix this using the skipinitialspace option.

edited Nov 21, 2011 at 10:37

answered Nov 21, 2011 at 7:21

David Webb

195k57 gold badges319 silver badges302 bronze badges

8 Comments

stema Over a year ago

This does not solve the OP's problem. "one, two" should not be splitted, because the comma is within the quotes, or do I misinterpret something? I tried this here by my own and got the same result as you, reading the doc csv, I understood that per default it should tread everything inside quotes as one field, per default.

avasal Over a year ago

@Dave webb: Djmac wants "one, two" in single variable which not the case in your answer...he requires output as ['48', ' "one, two"', ' "2011/11/03"'], length = 3 in your case length=4

David Webb Over a year ago

@stema - Good point! I didn't read the output of my code carefully enough. It turns out the problem is with the sample data. If a field starts with a space then csv assumes the field does too and the " is part of the field, i.e. csv does not automatically trim each value. I've fixed the sample data and the code now works. Thanks for point this out.

David Webb Over a year ago

@avasal - as noted above the problem was with the sample data (kind of) rather the code. Or rather, if you're going to use csv you have to be a bit more careful with your data format. Thanks for the help.

stema Over a year ago

Great, now its working! Hopefully also for the OP. +1 from me.

|

jcollado · Accepted Answer · 2011-11-21 08:00:04Z

7

You can use shlex module to parse your string.

By default, shlex.split will split your string at whitespace characters not enclosed in quotes:

>>> shlex.split(teststring)
['48,', 'one, two,', '2011/11/03']

This doesn't removes the trailing commas from your string, but it's close to what you need. However, if you customize the parser to consider the comma as a whitespace character, then you'll get the output that you need:

>>> parser = shlex.shlex(teststring)
>>> parser.whitespace
' \t\r\n'
>>> parser.whitespace += ','
>>> list(parser)
['48', '"one, two"', '"2011/11/03"']

Note: the parser object is used as an iterator to get the tokens one by one. Hence, list(parser) iterates over the parser object and returns the string splitted where you need.

answered Nov 21, 2011 at 8:00

jcollado

40.5k9 gold badges108 silver badges139 bronze badges

1 Comment

Raymond Hettinger Over a year ago

This gets the job done, but isn't as good of a fit as the csv module.

Mikhail Zakharov · Accepted Answer · 2020-03-30 12:03:14Z

7

This is not a standard module, you have to install it via pip, but as an option try tssplit:

In [5]: from tssplit import tssplit 
In [6]: tssplit('48, "one, two", "2011/11/03"', quote='"', delimiter=',', trim=' ')
Out[6]: ['48', 'one, two', '2011/11/03']

edited Mar 30, 2020 at 12:03

answered Mar 30, 2020 at 11:57

Mikhail Zakharov

1,1791 gold badge12 silver badges23 bronze badges

Comments

Acorn · Accepted Answer · 2011-11-21 07:12:09Z

3

You should use the Python csv library: http://docs.python.org/library/csv.html

answered Nov 21, 2011 at 7:12

Acorn

50.8k30 gold badges143 silver badges180 bronze badges

1 Comment

Raymond Hettinger Over a year ago

That link isn't enough to solve the problem. Right out of the box, a csv reader won't correctly parse the OP's test string.

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2014-05-07 15:30:53Z

1

import shlex
teststring = '48, "one, two", "2011/11/03"'
output = shlex.split(teststring)
output = [re.sub(r",$","",w) for w in output]
print output
['48', 'one, two', '2011/11/03']

edited May 7, 2014 at 15:30

A5C1D2H2I1M1N2O1R2T1

194k31 gold badges417 silver badges497 bronze badges

answered May 7, 2014 at 15:04

StreetHawk

945 bronze badges

Collectives™ on Stack Overflow

Split string, ignoring delimiter within quotation marks (python)

6 Answers 6

1 Comment

8 Comments

1 Comment

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

8 Comments

1 Comment

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related