How to read string as dataframe separating by comma but with irregular quotation marks?

Question

This is a small example of a big csv file. Imagine this csv file contais a text like this one below:

import io
import pandas as pd

text="""A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
"2019,01/2019,Brazil,0,""166,229"",0,""22578,86292"",""47,417"",,,,,,,,"
"2019,01/2019,Brazil,95,0,""50,34563"",0,""5,137"",,,,,,,,"
2019,01/2019,Brazil,0,0,0,0,0,,,,,,,,"""

I would like to read it as a dataframe and the first line should be the column names. My problem occurs with the lines 2 and 3 because of the quotation marks separating float numbers. If I use the code below, the last line is perfect because there aren't any quotation marks separating the numbers.

df = pd.read_csv(io.StringIO(text), sep=',', engine="python", encoding='utf-8', decimal=",")

df
Out[73]: 
                                                   A        B  ...   O   P
0  2019,01/2019,Brazil,0,"166,229",0,"22578,86292...     None  ... NaN NaN
1  2019,01/2019,Brazil,95,0,"50,34563",0,"5,137",...     None  ... NaN NaN
2                                               2019  01/2019  ... NaN NaN

[3 rows x 16 columns]

Could you show me how can I read this properly, separating each value by column? Thanks

Andrej Kesely · Accepted Answer · 2022-10-19 15:52:40Z

1

First I would check how the CSV file is generated and fix the root cause (why are there the quotation marks?) If this isn't possible, try to remove all quotation marks:

text = """A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
"2019,01/2019,Brazil,0,""166,229"",0,""22578,86292"",""47,417"",,,,,,,,"
"2019,01/2019,Brazil,95,0,""50,34563"",0,""5,137"",,,,,,,,"
2019,01/2019,Brazil,0,0,0,0,0,,,,,,,,"""

import re

text = re.sub(r'""(\d+,?\d*)""', lambda g: g.group(1).replace(",", "."), text)

df = pd.read_csv(
    io.StringIO(text.replace('"', "")),
    sep=",",
    engine="python",
    encoding="utf-8",
)

print(df)

This prints:

      A        B       C   D        E         F            G       H   I   J   K   L   M   N   O   P
0  2019  01/2019  Brazil   0  166.229   0.00000  22578.86292  47.417 NaN NaN NaN NaN NaN NaN NaN NaN
1  2019  01/2019  Brazil  95    0.000  50.34563      0.00000   5.137 NaN NaN NaN NaN NaN NaN NaN NaN
2  2019  01/2019  Brazil   0    0.000   0.00000      0.00000   0.000 NaN NaN NaN NaN NaN NaN NaN NaN

edited Oct 19, 2022 at 15:52

answered Oct 19, 2022 at 15:44

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user026 Over a year ago

Actually, the numbers separated by quotation marks are floats. So they should stay in one column and not separated. For example, 166.229 should be only in column B

Collectives™ on Stack Overflow

How to read string as dataframe separating by comma but with irregular quotation marks?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related