Using Pandas filtering non-numeric data from two columns of a Dataframe

Question

I'm loading a Pandas dataframe which has many data types (loaded from Excel). Two particular columns should be floats, but occasionally a researcher entered in a random comment like "not measured." I need to drop any rows where any values in one of two columns is not a number and preserve non-numeric data in other columns. A simple use case looks like this (the real table has several thousand rows...)

import pandas as pd

df = pd.DataFrame(dict(A = pd.Series([1,2,3,4,5]), B = pd.Series([96,33,45,'',8]), C = pd.Series([12,'Not measured',15,66,42]), D = pd.Series(['apples', 'oranges', 'peaches', 'plums', 'pears'])))

Which results in this data table:

    A   B   C               D
0   1   96  12              apples
1   2   33  Not measured    oranges
2   3   45  15              peaches
3   4       66              plums
4   5   8   42              pears

I'm not clear how to get to this table:

    A   B   C               D
0   1   96  12              apples
2   3   45  15              peaches
4   5   8   42              pears

I tried dropna, but the types are "object" since there are non-numeric entries. I can't convert the values to floats without either converting the whole table, or doing one series at a time which loses the relationship to the other data in the row. Perhaps there is something simple I'm not understanding?

jezrael · Accepted Answer · 2016-04-06 05:59:29Z

1

You can first create subset with columns B,C and apply to_numeric, check if all values are notnull. Then use boolean indexing:

print df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)
0     True
1    False
2     True
3    False
4     True
dtype: bool

print df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
   A   B   C        D
0  1  96  12   apples
2  3  45  15  peaches
4  5   8  42    pears

Next solution use str.isdigit with isnull and xor (^):

print df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()
0     True
1    False
2     True
3    False
4     True
dtype: bool

print df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
   A   B   C        D
0  1  96  12   apples
2  3  45  15  peaches
4  5   8  42    pears

But solution with to_numeric with isnull and notnull is fastest:

print df[pd.to_numeric(df['B'], errors='coerce').notnull() 
       ^ pd.to_numeric(df['C'], errors='coerce').isnull()]

   A   B   C        D
0  1  96  12   apples
2  3  45  15  peaches
4  5   8  42    pears

Timings:

#len(df) = 5k
df = pd.concat([df]*1000).reset_index(drop=True)

In [611]: %timeit df[pd.to_numeric(df['B'], errors='coerce').notnull() ^ pd.to_numeric(df['C'], errors='coerce').isnull()]
1000 loops, best of 3: 1.88 ms per loop

In [612]: %timeit df[df['B'].str.isdigit().isnull() ^ df['C'].str.isdigit().notnull()]
100 loops, best of 3: 16.1 ms per loop

In [613]: %timeit df[df[['B','C']].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)]
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 3.49 ms per loop

edited Apr 6, 2016 at 5:59

answered Apr 6, 2016 at 5:28

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ZSG Over a year ago

Thanks! I like the first solution with apply, notnull, for maintainability. It seems to work! I'll give it a day and see if there are any problems that pop up, or if someone responds with an even simpler solution.

Collectives™ on Stack Overflow

Using Pandas filtering non-numeric data from two columns of a Dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related