How to drop row in Dataframe if column is NaN and there is another row where the column is not NaN

Question

I have a pandas dataframe in python where the rows are identified by p1 & p2, but p2 is sometimes NaN:

   p1 p2
0  a  1
1  a  2
2  a  3
3  b  NaN
4  c  4
5  d  NaN
6  d  5

The above dataframe was returned from a larger one with many duplicates by using

df.drop_duplicates(subset=["p1","p2"], keep='last')

which works for the most part, the only issue being that NaN and 5 are technically not duplicates and therefore not dropped.

How can I drop the rows (such as: "d", NaN) where there is another row with the same p1 and a p2 value of not.null eg. "d", 5. The important thing here being that "b", NaN is kept because there are no rows with "b", not.null.

BENY · Accepted Answer · 2017-11-21 04:47:52Z

1

We can groupby and ffill and bfill, then drop_duplicates

df.assign(p2=df.groupby('p1')['p2'].apply(lambda x : x.ffill().bfill())).\
      drop_duplicates(subset=["p1","p2"], keep='last')
Out[645]: 
  p1   p2
0  a  1.0
1  a  2.0
2  a  3.0
3  b  NaN
4  c  4.0
6  d  5.0

answered Nov 21, 2017 at 4:47

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sebastian Mendez · Accepted Answer · 2017-11-21 05:16:17Z

1

This set of duplicates should essentially be the intersection of all rows which contain NaN values and rows which contain duplicate p1 elements, unioned with the those which are duplicates across both columns:

dupe_1 = df['p1'].duplicated(keep=False) & df['p2'].isnull()
dupe_2 = df.duplicated(subset=['p1','p2'])
total_dupes = dupe_1 | dupe_2
new_df = df[~total_dupes]

Note that this will fail for a dataframe such as:

  p1  p2
0  a NaN
1  a NaN

As both of those elements would be removed. Thus, we must first run df.drop_duplicates(subset=['p1','p2'], inplace=True, keep='last'), removing all but one of those rows, making the solution work fine once again.

edited Nov 21, 2017 at 5:16

answered Nov 21, 2017 at 4:50

Sebastian Mendez

2,99117 silver badges26 bronze badges

2 Comments

Jesse Over a year ago

df.drop_duplicates(subset=["p1","p2"], keep='last') This should remove all of those cases, so as long as I do your answer afterward it would work

Sebastian Mendez Over a year ago

Ah, excellent point, that would definitely fix that problem. I'll edit my answer to include that.

Collectives™ on Stack Overflow

How to drop row in Dataframe if column is NaN and there is another row where the column is not NaN

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related