Pandas dataframe merging rows to remove NaN

Question

I have a dataframe with some NaNs:

hostname period Teff
51 Peg  4.2293  5773
51 Peg  4.231   NaN
51 Peg  4.23077 NaN
55 Cnc  44.3787 NaN
55 Cnc  44.373  NaN
55 Cnc  44.4175 NaN
55 Cnc  NaN 5234
61 Vir  NaN 5577
61 Vir  38.021  NaN
61 Vir  123.01  NaN

The rows with the same "hostname" all refer to the same object, but as you can see, some entries have NaNs under various columns. I'd like to merge all the rows under the same hostname such that I retain the first finite value in each column (drop the row if all values are NaN). So the result should look like this:

hostname period Teff
51 Peg  4.2293  5773
55 Cnc  44.3787 5234
61 Vir  38.021  5577

How would you go about doing this?

You can simply use df3 = df1.combine_first(df2) which is designed to do exactly this operation. — user19077881
– user19077881, Commented Mar 17, 2023 at 14:37

akuiper · Accepted Answer · 2018-02-28 16:44:29Z

10

Use groupby.first; It takes the first non NA value:

df.groupby('hostname')[['period', 'Teff']].first().reset_index()
#  hostname   period  Teff
#0      Cnc  44.3787  5234
#1      Peg   4.2293  5773
#2      Vir  38.0210  5577

Or manually do this with a custom aggregation function:

df.groupby('hostname')[['period', 'Teff']].agg(lambda x: x.dropna().iat[0]).reset_index()

This requires each group has at least one non NA value.

Write your own function to handle the edge case:

def first_(g):
    non_na = g.dropna()
    return non_na.iat[0] if len(non_na) > 0 else pd.np.nan

df.groupby('hostname')[['period', 'Teff']].agg(first_).reset_index()

#  hostname   period  Teff
#0      Cnc  44.3787  5234
#1      Peg   4.2293  5773
#2      Vir  38.0210  5577

edited Feb 28, 2018 at 16:44

answered Feb 28, 2018 at 16:34

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ylangylang Over a year ago

Both of your solutions works well, except they also drop all the other columns (if my dataframe has columns in addition to those three named here).

akuiper Over a year ago

If you want to take the first value from all columns, you should be able to simply do df.groupby('hostname').first().reset_index() without selecting columns.

ylangylang Over a year ago

I see. That does what I need it to do, and is short and simple. Thanks!

BENY · Accepted Answer · 2018-02-28 16:36:08Z

1

Is this what you need ?

pd.concat([ df1.apply(lambda x: sorted(x, key=pd.isnull)) for _, df1 in df.groupby('hostname')]).dropna()
Out[343]: 
   hostname   period    Teff
55      Cnc  44.3787  5234.0
51      Peg   4.2293  5773.0
61      Vir  38.0210  5577.0

answered Feb 28, 2018 at 16:36

BENY

324k22 gold badges176 silver badges250 bronze badges

2 Comments

ylangylang Over a year ago

Thank you. This appears to work perfectly with one small modification to only drop NaNs in selected columns:

pd.concat([ df1.apply(lambda x: sorted(x, key=pd.isnull)) for _, df1 in df.groupby('hostname')]).dropna(subset=['period','Teff'])

BENY Over a year ago

@mcglashan aha , :-) glad it help

Collectives™ on Stack Overflow

Pandas dataframe merging rows to remove NaN

2 Answers 2

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related