Pandas | merge rows with same id

Question

Here is the example data set

id         firstname    lastname      email               update date
A1         wendy         smith         [email protected]        2018-01-02
A1         wendy         smith         [email protected]     2019-02-03 
A2         harry         lynn          [email protected]      2016-04-03
A2         harry                       [email protected]  2019-03-12
A3         tinna         dickey        [email protected]     2016-04-03
A3         tinna         dickey        [email protected]     2013-06-12
A4         Tom           Lee           [email protected]       2012-06-12
A5         Ella                        [email protected]      2019-07-12
A6         Ben           Lang          [email protected]       2019-03-12

I have sorted the data set by id and update date, I want to merge the rows with same id, if one row with empty value, fill the other one with same id, if confilct, use the latest one. For rows with no duplicate id leave the empty cell as it is.

the output should be:

id         firstname    lastname      email               update date
A1         wendy         smith         [email protected]     2019-02-03 
A2         harry         lynn          [email protected]  2019-03-12
A3         tinna         dickey        [email protected]     2019-03-12
A4         Tom           Lee           [email protected]       2012-06-12
A5         Ella                        [email protected]      2019-07-12
A6         Ben           Lang          [email protected]       2019-03-12

my attempt was using ffill() to merge rows with empty and keep last duplicate, but the result seems to affect other cells which should have empty values(like lastname in A5 should be empty ).

df=df.ffill().drop_duplicates('id',keep='last')

Erfan · Accepted Answer · 2019-10-03 12:05:16Z

4

Use GroupBy.ffill to only forward fill for the same group. Then use drop_duplicates:

df['lastname'] = df.groupby('id')['lastname'].ffill()
df = df.drop_duplicates('id', keep='last')

Or in one line (but less readable in my opinion), using assign:

df.assign(lastname=df.groupby('id')['lastname'].ffill()).drop_duplicates('id', keep='last')

Output

   id firstname lastname              email update date
1  A1     wendy    smith     [email protected]  2019-02-03
3  A2     harry     lynn  [email protected]  2019-03-12
5  A3     tinna   dickey     [email protected]  2013-06-12
6  A4       Tom      Lee       [email protected]  2012-06-12
7  A5      Ella      NaN      [email protected]  2019-07-12
8  A6       Ben     Lang       [email protected]  2019-03-12

answered Oct 3, 2019 at 12:05

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2748930 Over a year ago

if other columns contain empty values that need to be grouped, can I add more lines like df['email'] = df.groupby('id')['email'].ffill() ... before drop duplicates? Thanks

bharatk · Accepted Answer · 2019-10-03 12:35:53Z

2

Use

DataFrame.groupby - Group DataFrame or Series using a mapper or by a Series of columns.
.groupby.GroupBy.last - Compute last of group values.
DataFrame.replace - Replace values given in to_replace with value.

Ex.

df = df.replace('',np.nan, regex=True)
df1 = df.groupby('id',as_index=False,sort=False).last()
print(df1)

   id firstname lastname              email  updatedate
0  A1     wendy    smith     [email protected]  2019-02-03
1  A2     harry     lynn  [email protected]  2019-03-12
2  A3     tinna   dickey     [email protected]  2013-06-12
3  A4       Tom      Lee       [email protected]  2012-06-12
4  A5      Ella      NaN      [email protected]  2019-07-12
5  A6       Ben     Lang       [email protected]  2019-03-12

edited Oct 3, 2019 at 12:35

answered Oct 3, 2019 at 11:45

bharatk

4,3455 gold badges19 silver badges31 bronze badges

Comments

Divya Dass · Accepted Answer · 2019-10-03 12:16:55Z

1

Try this:

df.groupby('id').ffill().drop_duplicates('id', keep='last')

output:

   id firstname lastname              email  update date
1  A1     wendy    smith     [email protected]  2019-02-03 
3  A2     harry     lynn  [email protected]   2019-03-12
5  A3     tinna   dickey     [email protected]   2013-06-12
6  A4       Tom      Lee       [email protected]   2012-06-12
7  A5      Ella      NaN      [email protected]   2019-07-12
8  A6       Ben     Lang       [email protected]   2019-03-12

edited Oct 3, 2019 at 12:16

answered Oct 3, 2019 at 12:10

Divya Dass

1661 silver badge10 bronze badges

3 Comments

Divya Dass Over a year ago

because of ffill() , id won't be set as an index

user2748930 Over a year ago

@ Divya Dass When I tried your code, I got an error KeyError: Index(['id'], dtype='object') Do you know what might cause the error? thanks

Divya Dass Over a year ago

@user2748930 Not sure why you are getting this error. Erfan also stated to have got this error earlier. This error means that id has become an index in your case. In my case it had become a column after I had used ffill(). Therefore, I used drop_duplicates() after this. In your case df.groupby('id').ffill().reset_index().drop_duplicates('id', keep='last') might help. This will make id a column if it has become index in your code.

adrianp · Accepted Answer · 2019-10-03 12:20:17Z

0

Use a combination of groupby, apply, and iloc:

df.groupby('id', as_index=False).apply(lambda x: x.fillna(method='ffill').iloc[0])

   id firstname lastname              email  update date
0  A1     wendy    smith     [email protected]  2019-02-03
1  A2     harry     lynn  [email protected]  2019-03-12
2  A3     tinna   dickey     [email protected]  2019-03-12
3  A4       Tom      Lee       [email protected]  2019-06-12
4  A5      Ella      NaN      [email protected]  2019-07-12
5  A6       Ben     Lang       [email protected]  2019-03-12

groupby groups the dataframe by unique ids
fillna fills all the NaN values with the row with non-NaN values
iloc[-1] gets you the row with the latest data

edited Oct 3, 2019 at 12:20

answered Oct 3, 2019 at 12:14

adrianp

1,0191 gold badge9 silver badges14 bronze badges

Collectives™ on Stack Overflow

Pandas | merge rows with same id

4 Answers 4

1 Comment

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related