13

I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing values). How can I do that in Python?

My data is a Pandas dataframe.

0

9 Answers 9

29

Demo:

Setup:

In [105]: df = pd.DataFrame(np.random.choice([2,np.nan], (20, 5), p=[0.2, 0.8]), columns=list('abcde'))

In [106]: df
Out[106]:
      a    b    c    d    e
0   NaN  2.0  NaN  NaN  NaN
1   NaN  NaN  2.0  NaN  2.0
2   NaN  2.0  NaN  NaN  NaN
3   NaN  NaN  NaN  NaN  2.0
4   NaN  2.0  2.0  NaN  NaN
5   NaN  NaN  NaN  NaN  NaN
6   NaN  2.0  NaN  NaN  NaN
7   2.0  2.0  NaN  NaN  NaN
8   2.0  2.0  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN
10  NaN  2.0  2.0  NaN  2.0
11  NaN  NaN  NaN  2.0  NaN
12  2.0  NaN  NaN  2.0  NaN
13  NaN  NaN  NaN  2.0  NaN
14  NaN  NaN  NaN  2.0  2.0
15  NaN  NaN  NaN  NaN  NaN
16  NaN  2.0  NaN  NaN  NaN
17  2.0  NaN  NaN  NaN  2.0
18  NaN  NaN  NaN  2.0  NaN
19  NaN  2.0  NaN  2.0  NaN

In [107]: df.isnull().mean()
Out[107]:
a    0.80
b    0.55
c    0.85
d    0.70
e    0.75
dtype: float64

Solution:

In [108]: df.columns[df.isnull().mean() < 0.8]
Out[108]: Index(['b', 'd', 'e'], dtype='object')

In [109]: df[df.columns[df.isnull().mean() < 0.8]]
Out[109]:
      b    d    e
0   2.0  NaN  NaN
1   NaN  NaN  2.0
2   2.0  NaN  NaN
3   NaN  NaN  2.0
4   2.0  NaN  NaN
5   NaN  NaN  NaN
6   2.0  NaN  NaN
7   2.0  NaN  NaN
8   2.0  NaN  NaN
9   NaN  NaN  NaN
10  2.0  NaN  2.0
11  NaN  2.0  NaN
12  NaN  2.0  NaN
13  NaN  2.0  NaN
14  NaN  2.0  2.0
15  NaN  NaN  NaN
16  2.0  NaN  NaN
17  NaN  NaN  2.0
18  NaN  2.0  NaN
19  2.0  2.0  NaN
Sign up to request clarification or add additional context in comments.

1 Comment

Great solution as always, +1. However, for visibility I'd say it is better to have more columns rather than rows. I added a row filter as an answer too. (or maybe just me - sitting on laptop atm)
20

You can use Pandas' dropna().

limitPer = len(yourdf) * .80
yourdf = yourdf.dropna(thresh=limitPer, axis=1)

1 Comment

The more pandas-style solution!
5

Following MaxU's example, this is the option for filtering rows:

    df = pd.DataFrame(np.random.choice([2,np.nan], (5,10), p=[0.2, 0.8]), columns=list('abcdefghij'))
        a    b    c    d    e    f    g    h    i    j
    0   NaN  NaN  NaN  NaN  NaN  2.0  NaN  NaN  NaN  2.0
    1   NaN  2.0  NaN  2.0  NaN  NaN  2.0  NaN  NaN  2.0
    2   NaN  NaN  2.0  NaN  2.0  NaN  2.0  2.0  NaN  NaN
    3   NaN  NaN  NaN  NaN  NaN  2.0  NaN  NaN  NaN  2.0
    4   2.0  2.0  2.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Rows

    df.loc[df.isnull().mean(axis=1).lt(0.8)]
        a    b    c    d    e    f    g    h    i    j
    1   NaN  2.0  NaN  2.0  NaN  NaN  2.0  NaN  NaN  2.0
    2   NaN  NaN  2.0  NaN  2.0  NaN  2.0  2.0  NaN  NaN
    4   2.0  2.0  2.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Comments

4

To generalize within Pandas you can do the following to calculate the percent of values in a column with missing values. From those columns you can filter out the features with more than 80% NULL values and then drop those columns from the DataFrame.

pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.80].index
df.drop(missing_features, axis=1, inplace=True)

Comments

4

Here is a simple function which you can use directly by passing a dataframe and a threshold

def rmissingvaluecol(dff, threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index)) >= threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values: "%threshold, (dff.shape[1] - len(l)))
    print("Columns:\n", list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,80) # Here threshold is 80% which means we are going to drop columns having more than 80% of missing values

# Output
'''
# Columns having more than 60 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create a new dataframe excluding these columns:

l = rmissingvaluecol(df, 49)
df1 = df[l]

Bonus step

You can find the percentage of missing values for each column (optional)

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

# Output
'''
id          83.33
location    83.33
owner       16.67
pets        16.67
dtype: float64
'''

3 Comments

Is there anyway to store all these in a dataframe?
It is there already.... l = rmissingvaluecol(df, 49) df1 = df[l]
Thanks. that saves the columns that are left. Can I store the columns being removed too?
2

The fastest way to find the sum of NaN or the percentage by columns is:

  • for the sum: df.isna().sum()
  • for the percentage: df.isna().mean()

Comments

1
def show_null_columns(data, agg, threshold):
    if agg == 'sum':
       null_cols = data.isnull().sum()
    elif agg == 'mean':
       null_cols = data.isnull().mean()
    columns = data.columns
    null_dic = {}
    for col,x in zip(columns, null_cols):
        if x>= threshold:
            null_dic[col] = x
    return null_dic

null_dic = show_null_columns(train, 'mean', 0.8)
train2 = train.drop(null_dic.keys(), axis=1)

Comments

0

Use:

df = df[df.isnull().sum(axis=1) <= 5]

Here we remove the missing values from the rows having greater than five missing values.

Comments

0

One thing about dropna() according to the documentation: the thresh argument specifies the number of non-NaNs to keep.

6 Comments

Welcome to StackOverflow. This seems more like a comment than an answer. Please consider commenting on the answer you like the best.
I did try that, but I don't seem to have enough "reputation" yet. However, I think the point I mentioned can change the output of dropna().
Hang in there. It won't be long before you can make comments.
Thanks for the words of encouragement!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.