Python Pandas - Merging two Dataframes with new and old rows

Question

I have two Dataframes with rows with an identical (corresponding) index, which I want to merge. Every row has an update-time. For rows with identical indexes the row with the higher update-time wins. All fields from the ‘newer’ row should be taken, except for the fields were only in the ‘older’ row are values. Example:

df1 = pd.DataFrame({'Hugo' : {'age' : 21, 'weight' : 75},
                   'Niklas': {'age' : 46, 'weight' : 65},
                   'Ronald' : {'age' : 76, 'weight' : 85, 'height' : 176}}).T
df1.index.names = ['name']
df1['update_time'] = 1

df2 = pd.DataFrame({'Hugo' : {'age' : 22, 'weight' : 77},
                   'Bertram': {'age' : 45, 'weight' : 65, 'height' : 190},
                   'Donald' : {'age' : 75, 'weight' : 85},
                   'Ronald' : {'age' : 77, 'weight' : 84}}).T
df2.index.names = ['name']
df2['update_time'] = 2


df1:
+--------+-------+----------+----------+---------------+
| name   |   age |   height |   weight |   update_time |
|--------+-------+----------+----------+---------------|
| Hugo   |    21 |      nan |       75 |             1 |
| Niklas |    46 |      nan |       65 |             1 |
| Ronald |    76 |      176 |       85 |             1 |
+--------+-------+----------+----------+---------------+
df2:
+---------+-------+----------+---------------+
| name    |   age |   weight |   update_time |
|---------+-------+----------+---------------|
| Bertram |    45 |       65 |             2 |
| Donald  |    75 |       85 |             2 |
| Hugo    |    22 |       77 |             2 |
| Ronald  |    77 |       84 |             2 |
+---------+-------+----------+---------------+

Result should look like this:

+---------+-------+----------+----------+---------------+
| name    |   age |   height |   weight |   update_time |
|---------+-------+----------+----------+---------------|
| Niklas  |    46 |      nan |       65 |             1 |
| Bertram |    45 |      190 |       65 |             2 |
| Donald  |    75 |      nan |       85 |             2 |
| Hugo    |    22 |      nan |       77 |             2 |
| Ronald  |    77 |      176 |       84 |             2 |
+---------+-------+----------+----------+---------------+

How could I do this ? The Problem is to keep the field with the height of Ronald. If I do first an df.Update of df1 then the timestamp isn't there anymore and i cannot find the older duplicates. If I do an df.append I can't merge the fields.

1st , your input dose not match your output ,2nd you should search before ask . — BENY
– BENY, Commented Apr 9, 2018 at 18:08
When I use combine_first (which I didn't know and I searched al lot before asking) I get exact the upper output with the upper input. — Egirus Ornila
– Egirus Ornila, Commented Apr 9, 2018 at 18:54

Scott Boston · Accepted Answer · 2018-04-09 18:04:52Z

6

Use combine_first:

df2.combine_first(df1)

output:

          age  height  weight  update_time
name                                      
Bertram  45.0   190.0    65.0          2.0
Donald   75.0     NaN    85.0          2.0
Hugo     22.0     NaN    77.0          2.0
Niklas   46.0     NaN    65.0          1.0
Ronald   77.0   176.0    84.0          2.0

answered Apr 9, 2018 at 18:04

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

double0darbo Over a year ago

This is what I have been looking for for some time. My solution was far more clunky and I'm sure going to make some trouble.

double0darbo · Accepted Answer · 2022-09-30 17:00:52Z

0

I am putting my answer here as parity, being another working solution I have used primarily:

rdpz = lambda df: df.groupby(df.index).last()
df3 = rdpz( pd.concat([df1, df2]) )

Here, the function rdpz groups the concatenated list of dataframes by their indices and chooses the last entry (first can be substituted for last if list order is reversed).

As said previously, this is not nearly as clean as Scott Boston's answer, but helpful if you want to extend to multiple dataframes or have more tricky multiindexes. In one line, you can use:

df3 = pd.concat([df1, df2]).apply(lambda df: df.groupby(df.index).last())

answered Sep 30, 2022 at 17:00

double0darbo

1762 silver badges12 bronze badges

Collectives™ on Stack Overflow

Python Pandas - Merging two Dataframes with new and old rows

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related