Remove Columns with missing values above a threshold pandas

Question

I am doing data preprocessing and want to remove features/columns which have more than say 10% missing values.

I have made the below code:

df_missing=df.isna()
result=df_missing.sum()/len(df)
result

Default           0.010066
Income            0.142857
Age               0.109090
Name              0.047000
Gender            0.000000
Type of job       0.200000
Amt of credit     0.850090
Years employed    0.009003
dtype: float64

I want df to have columns only where there are no missing values above 10%.

Expected output:

df

Default   Name   Gender   Years employed

(columns where there were missing values greater than 10% are removed.)

I have tried

result.iloc[:,0] 
IndexingError: Too many indexers

Please help

jezrael · Accepted Answer · 2020-02-28 11:44:01Z

4

Because division of sum by length is mean, you can instead df_missing.sum()/len(df) use df_missing.mean():

result = df.isna().mean()

Then filter by DataFrame.loc with : for all rows and columns by mask:

df = df.loc[:,result > .1]

edited Feb 28, 2020 at 11:44

answered Feb 28, 2020 at 11:30

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

noob Over a year ago

Why have you taken .mean for df.isna().mean()

jezrael Over a year ago

@ShailajaGuptaKapoor - division sum by length is mean ;)

Syscall · Accepted Answer · 2021-03-05 14:33:06Z

1

it should be df = df.loc[:,result < .1] as the user only want to keep the columns that have less than 10% of the rows missing

edited Mar 5, 2021 at 14:33

Syscall

19.8k10 gold badges44 silver badges60 bronze badges

answered Mar 5, 2021 at 14:25

Unknown

111 bronze badge

Comments

ghost_in_the · Accepted Answer · 2022-12-02 22:15:35Z

0

pandas has built in methods for such things:

df_clean = df.dropna(axis=1, thresh=(len(df)*.1), inplace=False)

Or if you don't want to create an extra dataframe object you can do it inplace:

df.dropna(axis=1, thresh=(len(df)*.1), inplace=True)

answered Dec 2, 2022 at 22:15

ghost_in_the

1

Collectives™ on Stack Overflow

Remove Columns with missing values above a threshold pandas

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related