Dropping columns with high missing values

Question

I have a situation where I need to drop a lot of my dataframe columns where there are high missing values. I have created a new dataframe that gives me the missing values and the ratio of missing values from my original data set.

My original data set - data_merge2 looks like this :

A     B      C      D
123   ABC    X      Y
123   ABC    X      Y
NaN   ABC    NaN   NaN
123   ABC    NaN   NaN
245   ABC    NaN   NaN
345   ABC    NaN   NaN

The count data set looks like this that gives me the missing count and ratio:

     missing_count   missing_ratio
  C    4               0.10
  D    4               0.66

The code that I used to create the count dataset looks like :

#Only check those columns where there are missing values as we have got a lot of columns
new_df = (data_merge2.isna()
        .sum()
        .to_frame('missing_count')
        .assign(missing_ratio = lambda x: x['missing_count']/len(data_merge2)*100)
        .loc[data_merge2.isna().any()] )
print(new_df)

Now I want to drop the columns from the original dataframe whose missing ratio is >50% How should I achieve this?

ansev · Accepted Answer · 2020-01-23 16:32:02Z

4

Use:

data_merge2.loc[:,data_merge2.count().div(len(data_merge2)).ge(0.5)]
#Alternative
#df[df.columns[df.count().mul(2).gt(len(df))]]

or DataFrame.drop using new_df DataFrame

data_merge2.drop(columns = new_df.index[new_df['missing_ratio'].gt(50)])

Output

       A    B
0  123.0  ABC
1  123.0  ABC
2    NaN  ABC
3  123.0  ABC
4  245.0  ABC
5  345.0  ABC

edited Jan 23, 2020 at 16:32

answered Jan 23, 2020 at 16:14

ansev

31k5 gold badges21 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

anky · Accepted Answer · 2020-01-23 17:28:58Z

3

Adding another way with query and XOR:

data_merge2[data_merge2.columns ^ new_df.query('missing_ratio>50').index]

Or pandas way using Index.difference

data_merge2[data_merge2.columns.difference(new_df.query('missing_ratio>50').index)]

       A    B
0  123.0  ABC
1  123.0  ABC
2    NaN  ABC
3  123.0  ABC
4  245.0  ABC
5  345.0  ABC

edited Jan 23, 2020 at 17:28

answered Jan 23, 2020 at 16:23

anky

75.3k11 gold badges46 silver badges76 bronze badges

Collectives™ on Stack Overflow

Dropping columns with high missing values

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related