python csv module read csv split by comma but ignore the comma inside double or single quotes

Question

I have a .csv file with column values contain some commas. Below are the examples:

Header: ID     Value           Content                                            Date
        1      34             "market, business"                               12/20/2013
        2      15             "market, business", yesterday, metric            11/21/2014
        3      18             "market," business and yesterday                 10/20/2014
        4      19              yesterday, today,                               11/22/2014

This is the format of the .csv file which if I open in Sublime Text, it appears in format:

1, 34, "market, business", 12/20/2013
2, 15, "market, business", "yesterday, metric, 11/21/2014
3, 18, "market," business and yesterday, 10/20/2014
4, 19, yesterday, today, 11/22/2014

But what I want is after the python csv reader program is:

[1, 34, "market, business", 12/20/2013]
[2, 15, "market, business" "yesterday metric, 11/21/2014]
[3, 18, "market," business and yesterday, 10/20/2014]
[4, 19, yesterday today, 11/22/2014]

These are just sample data I have, the "content" column is the headache here cause csv module uses "," as separator, I used

reader = csv.reader(f, skipinitialspace=True)

It works for the first row if all the strings are inside one double quotes. But it doesn't apply for the third and second row if there're commas outside the quotes (single or double)

How can I solve the problem? I'm just using the traditional csv module in python now, does "panda" has the ability to solve the problem?

Thanks.

I made some updates, I think what I want is, method to specify comma at different places... Now I paste here it seems unreasonable cause there's no way I can find inside csv module to tell the differences from separator "," and "," inside a field. Even excel can't...

Any ideas?

Look at the list of "Related Questions" to the right. Do any of these answer your question? — kdopen
– kdopen, Commented Dec 18, 2014 at 19:24
The desired Python lists would raise SyntaxErrors because there are unmatched quotation marks and strings without any quotation marks. Please fix. — unutbu
– unutbu, Commented Dec 18, 2014 at 19:56
If all your records have only 4 fields (fixed) there is a trivial way — Bhargav Rao
– Bhargav Rao, Commented Dec 18, 2014 at 20:02

unutbu · Accepted Answer · 2014-12-18 20:06:26Z

If we can assume

each line begins with two ints separated by commas,
each line ends with a date, separated by a comma
everything remaining (in the middle) belongs in the third column

then your data could be parsed this way:

data = list()
with open('data') as f:
    for line in f:
        parts = line.split(',', 2)
        parts[2:4] = parts[2].rsplit(',', 1)
        parts[:2] = map(int, parts[:2])
        parts[2:] = map(str.strip, parts[2:])
        data.append(parts)

for row in data:
    print(row)

yields

[1, 34, '"market, business"', '12/20/2013']
[2, 15, '"market, business", "yesterday, metric', '11/21/2014']
[3, 18, '"market," business and yesterday', '10/20/2014']
[4, 19, 'yesterday, today', '11/22/2014']

You could then make a DataFrame like this:

import pandas as pd
df = pd.DataFrame(data, columns=['Id','Value','Content','Date'])
print(df)

yields

   Id  Value                                 Content        Date
0   1     34                      "market, business"  12/20/2013
1   2     15  "market, business", "yesterday, metric  11/21/2014
2   3     18        "market," business and yesterday  10/20/2014
3   4     19                        yesterday, today  11/22/2014

Collectives™ on Stack Overflow

python csv module read csv split by comma but ignore the comma inside double or single quotes

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related