How to use pandas read_csv to read numbers, dates and strings correctly from a csv file?

Question

I have a csv file (data.csv) like this:

A,B,C,D,E
1.50,2.70,"2,481","1,569",2.15
2020-1-1,2020-1-2,2020-1-3,2020-1-4,2020-1-5
John, Jeff, Ruben, Cath, James

I tried to use df=pd.read_csv("data.csv", thousands=',') I got df=

          A        B        C        D        E
0       1.5      2.7    2,481    1,569     2.15
1  2020-1-1 2020-1-2 2020-1-3 2020-1-4 2020-1-5
2      John     Jeff    Ruben     Cath    James

Looks OK but actually all numbers and dates are strings in df, while Excel can read/convert them correctly.

How can we read numbers, dates and strings from a csv file correctly?

one column usually only have one type, check your column A if contain date and number which is more than two types — BENY
– BENY, Commented Feb 20, 2021 at 17:02
yes, but the actual csv file is in this way, and there are many such csv files. — John
– John, Commented Feb 22, 2021 at 1:49

Stryder · Accepted Answer · 2021-02-20 18:22:00Z

1

The preferred way of handling this would be to read it in normally, taking the transpose and handling it column-wise so like this:

DF = read_csv(pth).T
DF
       0         1       2
A   1.50  2020-1-1    John
B   2.70  2020-1-2    Jeff
C  2,481  2020-1-3   Ruben
D  1,569  2020-1-4    Cath
E   2.15  2020-1-5   James

DF[0] = DF[0].str.replace(",","").astype(float)
DF
         0         1       2
A     1.50  2020-1-1    John
B     2.70  2020-1-2    Jeff
C  2481.00  2020-1-3   Ruben
D  1569.00  2020-1-4    Cath
E     2.15  2020-1-5   James

Then you also have series (columns) with the correct type:

DF[0]
A       1.50
B       2.70
C    2481.00
D    1569.00
E       2.15
Name: 0, dtype: float64  #<<<<< float

If you are really hell-bent on keeping the original shape, you could also do it like this:

df = read_csv(pth)
df.iloc[0,:] = df.iloc[0,:].str.replace(",", "").astype(float)
df
          A         B         C         D         E
0       1.5       2.7    2481.0    1569.0      2.15
1  2020-1-1  2020-1-2  2020-1-3  2020-1-4  2020-1-5
2      John      Jeff     Ruben      Cath     James

then you could do this

df.iloc[0,0] + df.iloc[0,2]
2482.5

But the row itself would still be an object and not float, which may be a disadvantage at some point:

df.iloc[0,:]
A       1.50
B       2.70
C    2481.00
D    1569.00
E       2.15
Name: 0, dtype: object   <<<< object

edited Feb 20, 2021 at 18:22

answered Feb 20, 2021 at 18:14

Stryder

8707 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

John Over a year ago

Thanks! I think this is the only way to work around until next Panda version solves this defect and works as good as Excel

Collectives™ on Stack Overflow

How to use pandas read_csv to read numbers, dates and strings correctly from a csv file?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related