Pandas DataFrame: most data in columns are 'float' , I want to delete the row which is 'str'

Question

wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]}

I want to delete the row with 'hhh', because all datas in 'a' are numbers. The original data size is huge. Thank you very much.

cs95 · Accepted Answer · 2017-12-30 08:29:36Z

1

Option 1
Convert a using pd.to_numeric

df.a = pd.to_numeric(df.a, errors='coerce')
df

     a    b
0  NaN  1.0
1  2.0  2.0
2  3.0  NaN
3  4.0  NaN
4  5.0  5.0

Non-Numeric columns are coerced to NaN. You can then drop this row -

df.dropna(subset=['a'])

     a    b
1  2.0  2.0
2  3.0  NaN
3  4.0  NaN
4  5.0  5.0

Option 2
Another alternative is using str.isdigit -

df.a.str.isdigit()

0    False
1      NaN
2      NaN
3      NaN
4      NaN
Name: a, dtype: object

Filter as such -

df[df.a.str.isdigit().isnull()]

   a    b
1  2  2.0
2  3  NaN
3  4  NaN
4  5  5.0

Notes -

This won't work for float columns
If the numbers are also as strings, then drop the isnull bit -
```
df[df.a.str.isdigit()]
```

answered Dec 30, 2017 at 8:29

cs95

406k106 gold badges745 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Anton vBR Over a year ago

Nice! I think there could be some sort of warning for creating float values. Something in the lines of: pd.to_numeric returns floats if there are nan-values.

cs95 Over a year ago

@AntonvBR Interestingly, it coerces the column to floats only if NaNs are generated as a result of coercing the non-numeric values. Otherwise, it attempts to preserve the current dtype :)

Anton vBR Over a year ago

Exactly, because of the fact that integers can't represent NaN values.

Anton vBR Over a year ago

A thought just crossed my mind. In this specific case we could even use this mask: pd.to_numeric(df.a, errors='coerce').notnull() which would work for floats too. Right?

hridayns · Accepted Answer · 2017-12-30 07:57:39Z

0

import pandas as pd
import numpy as np

wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})

#wu = wu[wu.a.str.contains('\d+',na=False)]

#wu = wu[wu.a.apply(lambda x: x.isnumeric())]

wu = wu[wu.a.apply(lambda x: isinstance(x, (int, np.int64)))]

print(wu)

Note that you missed out a closing parenthesis when creating your DataFrame.

I tried 3 ways, but only the third one worked. You can always try the other ones (commented out) if that works for you. Do let me know if it works on the larger dataset.

answered Dec 30, 2017 at 7:57

hridayns

6971 gold badge9 silver badges19 bronze badges

5 Comments

cs95 Over a year ago

Please don't use apply when you don't need to.

hridayns Over a year ago

I do not see anything wrong with using it. Can you explain why apply is a bad choice?

cs95 Over a year ago

Of course. apply is a convenience function that hides a loop. If there are N ways to solve the problem, apply is quite consistently the slowest of them. It provides no vectorisation, and does not assume anything about your function. Furthermore, it has a lot of overhead (just look at the source code), so quite often a simple python loop doing the same thing is faster.

cs95 Over a year ago

Had the same discussion with another user here: stackoverflow.com/questions/48027171/…

Optimesh Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ How can this be solved with vectorisation ?

Optimesh · Accepted Answer · 2017-12-31 20:09:46Z

0

df = pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})

df.drop(df[df['a'].apply(type) != int].index, inplace=True)

if you just want to view the appropriate rows:

df.loc[df['a'].apply(type) != int, :]

answered Dec 31, 2017 at 20:09

Optimesh

4321 gold badge7 silver badges18 bronze badges

Collectives™ on Stack Overflow

Pandas DataFrame: most data in columns are 'float' , I want to delete the row which is 'str'

3 Answers 3

4 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related