1

This is a small example of a big csv file. Imagine this csv file contais a text like this one below:

import io
import pandas as pd

text="""A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
"2019,01/2019,Brazil,0,""166,229"",0,""22578,86292"",""47,417"",,,,,,,,"
"2019,01/2019,Brazil,95,0,""50,34563"",0,""5,137"",,,,,,,,"
2019,01/2019,Brazil,0,0,0,0,0,,,,,,,,"""

I would like to read it as a dataframe and the first line should be the column names. My problem occurs with the lines 2 and 3 because of the quotation marks separating float numbers. If I use the code below, the last line is perfect because there aren't any quotation marks separating the numbers.

df = pd.read_csv(io.StringIO(text), sep=',', engine="python", encoding='utf-8', decimal=",")

df
Out[73]: 
                                                   A        B  ...   O   P
0  2019,01/2019,Brazil,0,"166,229",0,"22578,86292...     None  ... NaN NaN
1  2019,01/2019,Brazil,95,0,"50,34563",0,"5,137",...     None  ... NaN NaN
2                                               2019  01/2019  ... NaN NaN

[3 rows x 16 columns]

Could you show me how can I read this properly, separating each value by column? Thanks

1 Answer 1

1

First I would check how the CSV file is generated and fix the root cause (why are there the quotation marks?) If this isn't possible, try to remove all quotation marks:

text = """A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
"2019,01/2019,Brazil,0,""166,229"",0,""22578,86292"",""47,417"",,,,,,,,"
"2019,01/2019,Brazil,95,0,""50,34563"",0,""5,137"",,,,,,,,"
2019,01/2019,Brazil,0,0,0,0,0,,,,,,,,"""

import re

text = re.sub(r'""(\d+,?\d*)""', lambda g: g.group(1).replace(",", "."), text)

df = pd.read_csv(
    io.StringIO(text.replace('"', "")),
    sep=",",
    engine="python",
    encoding="utf-8",
)

print(df)

This prints:

      A        B       C   D        E         F            G       H   I   J   K   L   M   N   O   P
0  2019  01/2019  Brazil   0  166.229   0.00000  22578.86292  47.417 NaN NaN NaN NaN NaN NaN NaN NaN
1  2019  01/2019  Brazil  95    0.000  50.34563      0.00000   5.137 NaN NaN NaN NaN NaN NaN NaN NaN
2  2019  01/2019  Brazil   0    0.000   0.00000      0.00000   0.000 NaN NaN NaN NaN NaN NaN NaN NaN
Sign up to request clarification or add additional context in comments.

1 Comment

Actually, the numbers separated by quotation marks are floats. So they should stay in one column and not separated. For example, 166.229 should be only in column B

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.