pandas read_csv import gives mixed type for a column

Question

I have a csv file that contains 130,000 rows. After reading in the file using pandas' read_csv function, one of the Column("CallGuid") has mixed object types.

I did:

df = pd.read_csv("data.csv")

Then I have this:

In [10]: df["CallGuid"][32767]
Out[10]: 4129237051L    

In [11]: df["CallGuid"][32768]
Out[11]: u'4129259051'

All rows <= 32767 are of type long and all rows > 32767 are unicode

Why is this?

Have you checked that row in your original csv? It could be quoted or have some other issue, what happens if you do df = pd.read_csv("data.csv", skiprows=32768) is the dtype wrong? — EdChum
– EdChum, Commented Aug 27, 2014 at 15:15
@EdChum after I put in skiprows=32768 I lost the column names, which were in row 0, how do I keep the header row? — lessthanl0l
– lessthanl0l, Commented Aug 27, 2014 at 15:24
do skiprows=[32768]. You skipped the first 32768 rows without the [] — TomAugspurger
– TomAugspurger, Commented Aug 27, 2014 at 15:34
after skiprows=[32768], I still have df["CallGuid"][32767] as long and df["CallGuid"][32768] as unicode — lessthanl0l
– lessthanl0l, Commented Aug 27, 2014 at 15:39
The point being is whether the original data is ill formed, you need to check whether the original csv has ill formed data, otherwise you could fix this after loading by doing df[CallGuid'] = df['CallGui'].astype(int64) — EdChum
– EdChum, Commented Aug 27, 2014 at 15:52

paulo.filip3 · Accepted Answer · 2015-04-24 12:09:41Z

6

As others have pointed out, your data could be malformed, like having quotes or something...

Just try doing:

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", dtype={"CallGuid": np.int64})

It's also more memory efficient, since pandas doesn't have to guess the data types.

answered Apr 24, 2015 at 12:09

paulo.filip3

3,3071 gold badge25 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

WNG · Accepted Answer · 2016-02-08 22:14:31Z

3

OK I just experienced the same problem, with the same symptom : df[column][n] changed type after n>32767

I indeed had a problem in my data, but not at all at line 32767

Finding and modifying these few problematic lines solved my problem. I managed to localize the line that was problematic by using the following extremely dirty routine :

df = pd.read_csv('data.csv',chunksize = 10000)
i=0
for chunk in df:
    print "{} {}".format(i,chunk["Custom Dimension 02"].dtype)
    i+=1

I ran this and I obtained :

0 int64
1 int64
2 int64
3 int64
4 int64
5 int64
6 object
7 int64
8 object
9 int64
10 int64

Which told me that there was (at least) one problematic line between 60000 and 69999 and one between 80000 and 89999

To localize them more precisely, you can just take a smaller chunksize and print only the number of the rows that do not have the correct dta type

edited Feb 8, 2016 at 22:14

answered Feb 8, 2016 at 21:27

WNG

3,8153 gold badges25 silver badges32 bronze badges

1 Comment

brehma Over a year ago

I found this super useful! I now have this as part of my script and created an embedded loop that goes 1 by 1 with 100 chunks to redefine the correct dtype.

Collectives™ on Stack Overflow

pandas read_csv import gives mixed type for a column

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related