Combine Dataframe rows to fill in missing data

Question

Suppose I have a dataframe with rows containing missing data, but a set of columns acting as a key:

import pandas as pd
import numpy as np
data = {"id": [1, 1, 2, 2, 3, 3, 4 ,4], "name": ["John", "John", "Paul", "Paul", "Ringo", "Ringo", "George", "George"], "height": [178, np.nan, 182, np.nan, 175, np.nan, 188, np.nan], "weight": [np.nan, np.NaN, np.nan, 72, np.nan, 68, np.nan, 70]}

df = pd.DataFrame.from_dict(data)
print(df)


id    name  height  weight
0   1    John   178.0     NaN
1   1    John     NaN     NaN
2   2    Paul   182.0     NaN
3   2    Paul     NaN    72.0
4   3   Ringo   175.0     NaN
5   3   Ringo     NaN    68.0
6   4  George   188.0     NaN
7   4  George     NaN    70.0

How would I go about "squashing" these rows with duplicate keys down to pick the non-nan value (if it exists)?

desired output:
id    name  height  weight
0   1    John   178.0     NaN
2   2    Paul   182.0     72.0
4   3   Ringo   175.0     68.0
6   4  George   188.0     70.0

The index doesn't matter, and there is always at most one row with Non-NaN data. I think I need to use groupby(['id', 'name']), but I'm not sure where to go from there.

jezrael · Accepted Answer · 2019-09-20 08:19:34Z

2

If there are always only one non NaNs values per groups is possible aggregate many ways:

df = df.groupby(['id', 'name'], as_index=False).first()

Or:

df = df.groupby(['id', 'name'], as_index=False).last()

Or:

df = df.groupby(['id', 'name'], as_index=False).mean()

Or:

df = df.groupby(['id', 'name'], as_index=False).sum(min_count=1)

answered Sep 20, 2019 at 8:19

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Combine Dataframe rows to fill in missing data

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related