2

Suppose I have a dataframe with rows containing missing data, but a set of columns acting as a key:

import pandas as pd
import numpy as np
data = {"id": [1, 1, 2, 2, 3, 3, 4 ,4], "name": ["John", "John", "Paul", "Paul", "Ringo", "Ringo", "George", "George"], "height": [178, np.nan, 182, np.nan, 175, np.nan, 188, np.nan], "weight": [np.nan, np.NaN, np.nan, 72, np.nan, 68, np.nan, 70]}

df = pd.DataFrame.from_dict(data)
print(df)


id    name  height  weight
0   1    John   178.0     NaN
1   1    John     NaN     NaN
2   2    Paul   182.0     NaN
3   2    Paul     NaN    72.0
4   3   Ringo   175.0     NaN
5   3   Ringo     NaN    68.0
6   4  George   188.0     NaN
7   4  George     NaN    70.0

How would I go about "squashing" these rows with duplicate keys down to pick the non-nan value (if it exists)?

desired output:
id    name  height  weight
0   1    John   178.0     NaN
2   2    Paul   182.0     72.0
4   3   Ringo   175.0     68.0
6   4  George   188.0     70.0

The index doesn't matter, and there is always at most one row with Non-NaN data. I think I need to use groupby(['id', 'name']), but I'm not sure where to go from there.

1 Answer 1

2

If there are always only one non NaNs values per groups is possible aggregate many ways:

df = df.groupby(['id', 'name'], as_index=False).first()

Or:

df = df.groupby(['id', 'name'], as_index=False).last()

Or:

df = df.groupby(['id', 'name'], as_index=False).mean()

Or:

df = df.groupby(['id', 'name'], as_index=False).sum(min_count=1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.