15

I have a csv file that contains 130,000 rows. After reading in the file using pandas' read_csv function, one of the Column("CallGuid") has mixed object types.

I did:

df = pd.read_csv("data.csv")

Then I have this:

In [10]: df["CallGuid"][32767]
Out[10]: 4129237051L    

In [11]: df["CallGuid"][32768]
Out[11]: u'4129259051'

All rows <= 32767 are of type long and all rows > 32767 are unicode

Why is this?

15
  • Have you checked that row in your original csv? It could be quoted or have some other issue, what happens if you do df = pd.read_csv("data.csv", skiprows=32768) is the dtype wrong? Commented Aug 27, 2014 at 15:15
  • @EdChum after I put in skiprows=32768 I lost the column names, which were in row 0, how do I keep the header row? Commented Aug 27, 2014 at 15:24
  • 1
    do skiprows=[32768]. You skipped the first 32768 rows without the [] Commented Aug 27, 2014 at 15:34
  • after skiprows=[32768], I still have df["CallGuid"][32767] as long and df["CallGuid"][32768] as unicode Commented Aug 27, 2014 at 15:39
  • 4
    The point being is whether the original data is ill formed, you need to check whether the original csv has ill formed data, otherwise you could fix this after loading by doing df[CallGuid'] = df['CallGui'].astype(int64) Commented Aug 27, 2014 at 15:52

2 Answers 2

6

As others have pointed out, your data could be malformed, like having quotes or something...

Just try doing:

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", dtype={"CallGuid": np.int64})

It's also more memory efficient, since pandas doesn't have to guess the data types.

Sign up to request clarification or add additional context in comments.

Comments

3

OK I just experienced the same problem, with the same symptom : df[column][n] changed type after n>32767

I indeed had a problem in my data, but not at all at line 32767

Finding and modifying these few problematic lines solved my problem. I managed to localize the line that was problematic by using the following extremely dirty routine :

df = pd.read_csv('data.csv',chunksize = 10000)
i=0
for chunk in df:
    print "{} {}".format(i,chunk["Custom Dimension 02"].dtype)
    i+=1

I ran this and I obtained :

0 int64
1 int64
2 int64
3 int64
4 int64
5 int64
6 object
7 int64
8 object
9 int64
10 int64

Which told me that there was (at least) one problematic line between 60000 and 69999 and one between 80000 and 89999

To localize them more precisely, you can just take a smaller chunksize and print only the number of the rows that do not have the correct dta type

1 Comment

I found this super useful! I now have this as part of my script and created an embedded loop that goes 1 by 1 with 100 chunks to redefine the correct dtype.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.