Pandas Merge DataFrames without rows overlap

Question

I have two dataframes like these:

They have the same columns.

Since I am broadcasting an API, they usually hava some overlap, which can be handled by the tradeID which is unique.

I have tried some stuff like:

df2 = df0.join(df1, how='outer', lsuffix='_caller', rsuffix='_other')

and

df2 = df0.merge(df1, left_index=True, right_index=True)

But the results are respectively:

and

I am looking for a union without overlap, could someone help me?

So when a tradeID is present in both data frames, what do you expect to appear in the merged result? — Igor Raush
– Igor Raush, Commented Jun 1, 2017 at 23:26
@IgorRaush, both rows would be exactly the same, I would like to keep just one of them, please also note that tradeID is an index — Thiago Melo
– Thiago Melo, Commented Jun 1, 2017 at 23:29
the code: df2 = df0.merge(df1, how='outer') works but it throws my indexes away — Thiago Melo
– Thiago Melo, Commented Jun 1, 2017 at 23:36

elPastor · Accepted Answer · 2017-06-02 00:55:38Z

6

Seems like combine_first() should do it for you:

df2 = df0.combine_first(df1)

...where df0 takes precedence over df1 when the indicies match. Although in your case, if they're identical, it doesn't really matter. But if they're not identical, that's how combine_first() works.

The following is an example of it working with dummy data.

Code:

import pandas as pd
import io

a = io.StringIO(u'''
tradeID,amount,date
X001,100,1/1/2016
X002,200,1/2/2016
X003,300,1/3/2016
X005,500,1/5/2016
''')

b = io.StringIO(u'''
tradeID,amount,date
X004,400,1/4/2016
X005,500,1/5/2016
X006,600,1/6/2016
''')

dfA = pd.read_csv(a, index_col = 'tradeID')
dfB = pd.read_csv(b, index_col = 'tradeID')

df = dfA.combine_first(dfB)

Output:

         amount      date
tradeID                  
X001      100.0  1/1/2016
X002      200.0  1/2/2016
X003      300.0  1/3/2016
X004      400.0  1/4/2016
X005      500.0  1/5/2016
X006      600.0  1/6/2016

If you really want to use merge you can still do that, but you'll need to add some syntax to keep your indicies (more info):

df = dfA.reset_index().merge(dfB.reset_index(), how = 'outer').set_index('tradeID')

I ran super rudimentary timing on these two options and combine_first() consistently beat merge by nearly 3x on this very small data set.

...and Igor Raush's version tested at or slightly faster than combine_first().

edited Jun 2, 2017 at 0:55

answered Jun 1, 2017 at 23:44

elPastor

9,14411 gold badges59 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Thiago Melo Over a year ago

awesome! it worked exactly like i needed! thank you very much!

elPastor Over a year ago

Glad I could help

Stevoisiak May 15 at 20:53

Does this check for uniqueness across all fields or just the ID?

Igor Raush · Accepted Answer · 2017-06-01 23:47:46Z

1

One way to accomplish this is

pd.concat([df0, df1]).loc[lambda df: ~df.index.duplicated()]

answered Jun 1, 2017 at 23:47

Igor Raush

15.3k1 gold badge38 silver badges58 bronze badges

Collectives™ on Stack Overflow

Pandas Merge DataFrames without rows overlap

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related