45

I have a dataframe which has several columns, so I chose some of its columns to create a variable like this.

xtrain = df[['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title']]

I want to drop from these columns all rows where the Survive column in the main dataframe is nan.

1

3 Answers 3

62

You can pass a boolean mask to your df based on notnull() of 'Survive' column and select the cols of interest:

In [2]:
# make some data
df = pd.DataFrame(np.random.randn(5,7), columns= ['Survive', 'Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ])
df['Survive'].iloc[2] = np.NaN
df
Out[2]:
    Survive       Age      Fare  Group_Size      deck    Pclass     Title
0  1.174206 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  0.036843  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
2       NaN -0.132394 -0.236904   -0.324087  0.570660  0.758084 -0.176421
3 -2.145934 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.197144 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482

Now pass a mask to loc to take only non NaN rows:

In [3]:
xtrain = df.loc[df['Survive'].notnull(), ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
xtrain

Out[3]:
        Age      Fare  Group_Size      deck    Pclass     Title
0 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
3 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482
Sign up to request clarification or add additional context in comments.

2 Comments

Just wish to know why the 'Survive' column is completely off in the output? The question asks for dropping all rows that have NaNs, not the entire columns that may have one or more NaNs.
@MuneshSingh the original question asked for an output with these columns ['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title'] (the OP explained the "Survive" column was in the original data, but not requested in the output). The "Survive" column is not included in the output bc it is not in the 'columns_indexer' list in the .loc call. i.e. df.loc[row_indexer, column_indexer]. See pandas.pydata.org/pandas-docs/stable/user_guide/… for a complete explanation.
10

Two alternatives because... well why not?
Both drop nan prior to column slicing. That's two call rather than EdChum's one call.

one

df.dropna(subset=['Survive'])[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]

two

df.query('Survive == Survive')[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]

1 Comment

df.dropna(subset=['Survive'])[['Survive','Age','Fare', 'Group_Size','deck', 'PCLass', 'Title' ]] will retain the 'Survive' column too.
1

It might be more readable if you assign the subset of the columns to a variable and filter.

notna_msk = df['Survive'].notna()
cols = ['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title', 'Survive']
new_df = df.loc[notna_msk, cols]

Also, in case you already created xtrain from df as in the OP, then you can still filter this dataframe with the mask, even if it doesn't have Survive column; just the index is enough.

new_df = xtrain.loc[df['Survive'].notna()]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.