69

I have a large dataframe with ID numbers:

ID.head()
Out[64]: 
0    4806105017087
1    4806105017087
2    4806105017087
3    4901295030089
4    4901295030089

These are all strings at the moment.

I want to convert to int without using loops - for this I use ID.astype(int).

The problem is that some of my lines contain dirty data which cannot be converted to int, for e.g.

ID[154382]
Out[58]: 'CN414149'

How can I (without using loops) remove these type of occurrences so that I can use astype with peace of mind?

0

3 Answers 3

123

You need add parameter errors='coerce' to function to_numeric:

ID = pd.to_numeric(ID, errors='coerce')

If ID is column:

df.ID = pd.to_numeric(df.ID, errors='coerce')

but non numeric are converted to NaN, so all values are float.

For int need convert NaN to some value e.g. 0 and then cast to int:

df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)

Sample:

df = pd.DataFrame({'ID':['4806105017087','4806105017087','CN414149']})
print (df)
              ID
0  4806105017087
1  4806105017087
2       CN414149

print (pd.to_numeric(df.ID, errors='coerce'))
0    4.806105e+12
1    4.806105e+12
2             NaN
Name: ID, dtype: float64

df.ID = pd.to_numeric(df.ID, errors='coerce').fillna(0).astype(np.int64)
print (df)
              ID
0  4806105017087
1  4806105017087
2              0

EDIT: If use pandas 0.25+ then is possible use integer_na:

df.ID = pd.to_numeric(df.ID, errors='coerce').astype('Int64')
print (df)
              ID
0  4806105017087
1  4806105017087
2            NaN
Sign up to request clarification or add additional context in comments.

Comments

10
  1. If you're here because you got
OverflowError: Python int too large to convert to C long

use .astype('int64') for 64-bit signed integers:

df['ID'] = df['ID'].astype('int64')

If you don't want to lose the values with letters in them, use str.replace() with a regex pattern to remove the non-digit characters.

df['ID'] = df['ID'].str.replace('[^0-9]', '', regex=True).astype('int64')

Then input

0    4806105017087
1    4806105017087
2         CN414149
Name: ID, dtype: object

converts into

0    4806105017087
1    4806105017087
2           414149
Name: ID, dtype: int64

Comments

0

I solved it Jan-2024 in the latest version of jupyter notebook by doing this.

Always use try and catch to see if its not working than what the error. I checked the "Price" data type and previously it was "o" and now its showing "int(64)". That's what we all looking for.

try:
    car_sales["Price"] = car_sales["Price"].str.replace('[\$\,]|\.\d*', '', regex=True).astype(int)
except ValueError as e:
    print(f"Error: {e}") 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.