pandas dataframe drop columns by number of nan

Question

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan

print dff

EdChum · Accepted Answer · 2015-06-19 12:11:11Z

37

There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:

In [13]:

dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
          A         B
0  0.517199 -0.806304
1 -0.643074  0.229602
2  0.656728  0.535155
3       NaN -0.162345
4 -0.309663 -0.783539
5  1.244725 -0.274514
6 -0.254232       NaN
7 -1.242430  0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416

So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.

edited Jun 19, 2015 at 12:11

answered Jun 18, 2015 at 20:26

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

pyan Over a year ago

Thanks. A typo in your code len(df) should be len(dff)

pceccon Over a year ago

Do you know if its possible to apply thresh except for a subset of specific columns? Thank you.

EdChum Over a year ago

@pceccon sorry but I tend to not answer questions in comments as it lacks clarity, you can subset columns by passing a list so IIUC you can do something like df[col_name_list].fillna(...) to only apply the thresh to this subset and to apply to the other columns df[df.columns.difference(col_name_list)].filna(....)

Alexander · Accepted Answer · 2015-06-18 19:19:18Z

4

You can use a conditional list comprehension:

>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
          A         B
0 -0.819004  0.919190
1  0.922164  0.088111
2  0.188150  0.847099
3       NaN -0.053563
4  1.327250 -0.376076
5  3.724980  0.292757
6 -0.319342       NaN
7 -1.051529  0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026

answered Jun 18, 2015 at 19:19

Alexander

111k32 gold badges212 silver badges208 bronze badges

2 Comments

Victor Zuanazzi Over a year ago

Is that an efficient implementation cpu wise?

Alexander Over a year ago

Trivial for most use cases, but the answer would depend on the size of your dataframes. The accepted answer is about 40% faster on my machine using a dataframe of 1m rows with 3 cols.

Ashish · Accepted Answer · 2019-02-02 11:23:25Z

1

Say you have to drop columns having more than 70% null values.

data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)

answered Feb 2, 2019 at 11:23

Ashish

111 bronze badge

Comments

stellasia · Accepted Answer · 2015-06-18 19:12:20Z

0

Here is a possible solution:

s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
   A    1 
   B    1
   C    3
   dtype: int64

for col in dff: 
   if s[col] >= 2:  
       del dff[col]

Or

for c in dff:
    if sum(dff[c].isnull()) >= 2:
        dff.drop(c, axis=1, inplace=True)

answered Jun 18, 2015 at 19:12

stellasia

5,6624 gold badges27 silver badges45 bronze badges

Comments

Axel · Accepted Answer · 2017-12-05 21:58:32Z

0

I recommend the drop-method. This is an alternative solution:

dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)

edited Dec 5, 2017 at 21:58

Axel

3,33111 gold badges39 silver badges61 bronze badges

answered Dec 5, 2017 at 21:16

NAGARAJ

112 bronze badges

Comments

Imam_AI · Accepted Answer · 2022-07-23 05:58:49Z

0

You can do this through another approach as well like below for dropping columns having certain number of na values:

df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])

For dropping columns having certain percentage of na values :

df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])

answered Jul 23, 2022 at 5:58

Imam_AI

1511 silver badge7 bronze badges

Collectives™ on Stack Overflow

pandas dataframe drop columns by number of nan

6 Answers 6

3 Comments

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related