1
wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]}

I want to delete the row with 'hhh', because all datas in 'a' are numbers. The original data size is huge. Thank you very much.

0

3 Answers 3

1

Option 1
Convert a using pd.to_numeric

df.a = pd.to_numeric(df.a, errors='coerce')
df

     a    b
0  NaN  1.0
1  2.0  2.0
2  3.0  NaN
3  4.0  NaN
4  5.0  5.0

Non-Numeric columns are coerced to NaN. You can then drop this row -

df.dropna(subset=['a'])

     a    b
1  2.0  2.0
2  3.0  NaN
3  4.0  NaN
4  5.0  5.0

Option 2
Another alternative is using str.isdigit -

df.a.str.isdigit()

0    False
1      NaN
2      NaN
3      NaN
4      NaN
Name: a, dtype: object

Filter as such -

df[df.a.str.isdigit().isnull()]

   a    b
1  2  2.0
2  3  NaN
3  4  NaN
4  5  5.0

Notes -

  • This won't work for float columns
  • If the numbers are also as strings, then drop the isnull bit -

    df[df.a.str.isdigit()]
    
Sign up to request clarification or add additional context in comments.

4 Comments

Nice! I think there could be some sort of warning for creating float values. Something in the lines of: pd.to_numeric returns floats if there are nan-values.
@AntonvBR Interestingly, it coerces the column to floats only if NaNs are generated as a result of coercing the non-numeric values. Otherwise, it attempts to preserve the current dtype :)
Exactly, because of the fact that integers can't represent NaN values.
A thought just crossed my mind. In this specific case we could even use this mask: pd.to_numeric(df.a, errors='coerce').notnull() which would work for floats too. Right?
0
import pandas as pd
import numpy as np

wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})

#wu = wu[wu.a.str.contains('\d+',na=False)]

#wu = wu[wu.a.apply(lambda x: x.isnumeric())]

wu = wu[wu.a.apply(lambda x: isinstance(x, (int, np.int64)))]

print(wu)

Note that you missed out a closing parenthesis when creating your DataFrame.

I tried 3 ways, but only the third one worked. You can always try the other ones (commented out) if that works for you. Do let me know if it works on the larger dataset.

5 Comments

Please don't use apply when you don't need to.
I do not see anything wrong with using it. Can you explain why apply is a bad choice?
Of course. apply is a convenience function that hides a loop. If there are N ways to solve the problem, apply is quite consistently the slowest of them. It provides no vectorisation, and does not assume anything about your function. Furthermore, it has a lot of overhead (just look at the source code), so quite often a simple python loop doing the same thing is faster.
Had the same discussion with another user here: stackoverflow.com/questions/48027171/…
@cᴏʟᴅsᴘᴇᴇᴅ How can this be solved with vectorisation ?
0
df = pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})

df.drop(df[df['a'].apply(type) != int].index, inplace=True)

if you just want to view the appropriate rows:

df.loc[df['a'].apply(type) != int, :]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.