1

I am doing data preprocessing and want to remove features/columns which have more than say 10% missing values.

I have made the below code:

df_missing=df.isna()
result=df_missing.sum()/len(df)
result

Default           0.010066
Income            0.142857
Age               0.109090
Name              0.047000
Gender            0.000000
Type of job       0.200000
Amt of credit     0.850090
Years employed    0.009003
dtype: float64

I want df to have columns only where there are no missing values above 10%.

Expected output:

df

Default   Name   Gender   Years employed

(columns where there were missing values greater than 10% are removed.)

I have tried

result.iloc[:,0] 
IndexingError: Too many indexers

Please help

3 Answers 3

4

Because division of sum by length is mean, you can instead df_missing.sum()/len(df) use df_missing.mean():

result = df.isna().mean()

Then filter by DataFrame.loc with : for all rows and columns by mask:

df = df.loc[:,result > .1]
Sign up to request clarification or add additional context in comments.

2 Comments

Why have you taken .mean for df.isna().mean()
@ShailajaGuptaKapoor - division sum by length is mean ;)
1

it should be df = df.loc[:,result < .1] as the user only want to keep the columns that have less than 10% of the rows missing

Comments

0

pandas has built in methods for such things:

df_clean = df.dropna(axis=1, thresh=(len(df)*.1), inplace=False)

Or if you don't want to create an extra dataframe object you can do it inplace:

df.dropna(axis=1, thresh=(len(df)*.1), inplace=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.