1

I have a .csv file with column values contain some commas. Below are the examples:

Header: ID     Value           Content                                            Date
        1      34             "market, business"                               12/20/2013
        2      15             "market, business", yesterday, metric            11/21/2014
        3      18             "market," business and yesterday                 10/20/2014
        4      19              yesterday, today,                               11/22/2014

This is the format of the .csv file which if I open in Sublime Text, it appears in format:

1, 34, "market, business", 12/20/2013
2, 15, "market, business", "yesterday, metric, 11/21/2014
3, 18, "market," business and yesterday, 10/20/2014
4, 19, yesterday, today, 11/22/2014

But what I want is after the python csv reader program is:

[1, 34, "market, business", 12/20/2013]
[2, 15, "market, business" "yesterday metric, 11/21/2014]
[3, 18, "market," business and yesterday, 10/20/2014]
[4, 19, yesterday today, 11/22/2014]

These are just sample data I have, the "content" column is the headache here cause csv module uses "," as separator, I used

reader = csv.reader(f, skipinitialspace=True)

It works for the first row if all the strings are inside one double quotes. But it doesn't apply for the third and second row if there're commas outside the quotes (single or double)

How can I solve the problem? I'm just using the traditional csv module in python now, does "panda" has the ability to solve the problem?

Thanks.

I made some updates, I think what I want is, method to specify comma at different places... Now I paste here it seems unreasonable cause there's no way I can find inside csv module to tell the differences from separator "," and "," inside a field. Even excel can't...

Any ideas?

6
  • Look at the list of "Related Questions" to the right. Do any of these answer your question? Commented Dec 18, 2014 at 19:24
  • Please post a sample of your csv and the desired DataFrame. Commented Dec 18, 2014 at 19:40
  • 1
    The desired Python lists would raise SyntaxErrors because there are unmatched quotation marks and strings without any quotation marks. Please fix. Commented Dec 18, 2014 at 19:56
  • If all your records have only 4 fields (fixed) there is a trivial way Commented Dec 18, 2014 at 20:02
  • 1
    @BhargavRao Unfortunately not. Commented Dec 18, 2014 at 20:09

1 Answer 1

2

If we can assume

  • each line begins with two ints separated by commas,
  • each line ends with a date, separated by a comma
  • everything remaining (in the middle) belongs in the third column

then your data could be parsed this way:

data = list()
with open('data') as f:
    for line in f:
        parts = line.split(',', 2)
        parts[2:4] = parts[2].rsplit(',', 1)
        parts[:2] = map(int, parts[:2])
        parts[2:] = map(str.strip, parts[2:])
        data.append(parts)

for row in data:
    print(row)

yields

[1, 34, '"market, business"', '12/20/2013']
[2, 15, '"market, business", "yesterday, metric', '11/21/2014']
[3, 18, '"market," business and yesterday', '10/20/2014']
[4, 19, 'yesterday, today', '11/22/2014']

You could then make a DataFrame like this:

import pandas as pd
df = pd.DataFrame(data, columns=['Id','Value','Content','Date'])
print(df)

yields

   Id  Value                                 Content        Date
0   1     34                      "market, business"  12/20/2013
1   2     15  "market, business", "yesterday, metric  11/21/2014
2   3     18        "market," business and yesterday  10/20/2014
3   4     19                        yesterday, today  11/22/2014
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.