22

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan

print dff

6 Answers 6

37

There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:

In [13]:

dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
          A         B
0  0.517199 -0.806304
1 -0.643074  0.229602
2  0.656728  0.535155
3       NaN -0.162345
4 -0.309663 -0.783539
5  1.244725 -0.274514
6 -0.254232       NaN
7 -1.242430  0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416

So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. A typo in your code len(df) should be len(dff)
Do you know if its possible to apply thresh except for a subset of specific columns? Thank you.
@pceccon sorry but I tend to not answer questions in comments as it lacks clarity, you can subset columns by passing a list so IIUC you can do something like df[col_name_list].fillna(...) to only apply the thresh to this subset and to apply to the other columns df[df.columns.difference(col_name_list)].filna(....)
4

You can use a conditional list comprehension:

>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
          A         B
0 -0.819004  0.919190
1  0.922164  0.088111
2  0.188150  0.847099
3       NaN -0.053563
4  1.327250 -0.376076
5  3.724980  0.292757
6 -0.319342       NaN
7 -1.051529  0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026

2 Comments

Is that an efficient implementation cpu wise?
Trivial for most use cases, but the answer would depend on the size of your dataframes. The accepted answer is about 40% faster on my machine using a dataframe of 1m rows with 3 cols.
1

Say you have to drop columns having more than 70% null values.

data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)

Comments

0

Here is a possible solution:

s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
   A    1 
   B    1
   C    3
   dtype: int64

for col in dff: 
   if s[col] >= 2:  
       del dff[col]

Or

for c in dff:
    if sum(dff[c].isnull()) >= 2:
        dff.drop(c, axis=1, inplace=True)

Comments

0

I recommend the drop-method. This is an alternative solution:

dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)

Comments

0

You can do this through another approach as well like below for dropping columns having certain number of na values:

df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])

For dropping columns having certain percentage of na values :

df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ]) 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.