Select non-null rows from a specific column in a DataFrame and take a sub-selection of other columns

Question

I have a dataframe which has several columns, so I chose some of its columns to create a variable like this.

xtrain = df[['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title']]

I want to drop from these columns all rows where the Survive column in the main dataframe is nan.

Related: stackoverflow.com/q/49673345/6064933

jdhao
– jdhao

2022-10-03 09:14:24 +00:00
Commented Oct 3, 2022 at 9:14 — jdhao
– jdhao, Commented Oct 3, 2022 at 9:14

EdChum · Accepted Answer · 2016-12-27 00:17:12Z

62

You can pass a boolean mask to your df based on notnull() of 'Survive' column and select the cols of interest:

In [2]:
# make some data
df = pd.DataFrame(np.random.randn(5,7), columns= ['Survive', 'Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ])
df['Survive'].iloc[2] = np.NaN
df
Out[2]:
    Survive       Age      Fare  Group_Size      deck    Pclass     Title
0  1.174206 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  0.036843  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
2       NaN -0.132394 -0.236904   -0.324087  0.570660  0.758084 -0.176421
3 -2.145934 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.197144 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482

Now pass a mask to loc to take only non NaN rows:

In [3]:
xtrain = df.loc[df['Survive'].notnull(), ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]
xtrain

Out[3]:
        Age      Fare  Group_Size      deck    Pclass     Title
0 -0.056846  0.454437    0.496695  1.401509 -2.078731 -1.024832
1  1.060134  0.770625   -0.114912  0.118991 -0.317909  0.061022
3 -0.020003 -0.777785    0.835467  1.498284 -1.371325  0.661991
4 -0.089806 -0.706548    1.621260  1.754292  0.725897  0.860482

edited Dec 27, 2016 at 0:17

answered Dec 27, 2016 at 0:09

EdChum

397k204 gold badges837 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MuneshSingh Over a year ago

Just wish to know why the 'Survive' column is completely off in the output? The question asks for dropping all rows that have NaNs, not the entire columns that may have one or more NaNs.

ryanhightower Over a year ago

@MuneshSingh the original question asked for an output with these columns ['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title'] (the OP explained the "Survive" column was in the original data, but not requested in the output). The "Survive" column is not included in the output bc it is not in the 'columns_indexer' list in the .loc call. i.e. df.loc[row_indexer, column_indexer]. See pandas.pydata.org/pandas-docs/stable/user_guide/… for a complete explanation.

piRSquared · Accepted Answer · 2016-12-27 00:29:07Z

10

Two alternatives because... well why not?
Both drop nan prior to column slicing. That's two call rather than EdChum's one call.

one

df.dropna(subset=['Survive'])[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]

two

df.query('Survive == Survive')[
    ['Age','Fare', 'Group_Size','deck', 'Pclass', 'Title' ]]

edited Dec 27, 2016 at 0:29

answered Dec 27, 2016 at 0:27

piRSquared

296k68 gold badges509 silver badges654 bronze badges

1 Comment

MuneshSingh Over a year ago

df.dropna(subset=['Survive'])[['Survive','Age','Fare', 'Group_Size','deck', 'PCLass', 'Title' ]] will retain the 'Survive' column too.

cottontail · Accepted Answer · 2023-02-14 22:34:41Z

1

It might be more readable if you assign the subset of the columns to a variable and filter.

notna_msk = df['Survive'].notna()
cols = ['Age', 'Fare', 'Group_Size', 'deck', 'Pclass', 'Title', 'Survive']
new_df = df.loc[notna_msk, cols]

Also, in case you already created xtrain from df as in the OP, then you can still filter this dataframe with the mask, even if it doesn't have Survive column; just the index is enough.

new_df = xtrain.loc[df['Survive'].notna()]

answered Feb 14, 2023 at 22:34

cottontail

25.7k25 gold badges187 silver badges177 bronze badges

Collectives™ on Stack Overflow

Select non-null rows from a specific column in a DataFrame and take a sub-selection of other columns

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related