2

I have two dataframes like these:

enter image description here

enter image description here

They have the same columns.

Since I am broadcasting an API, they usually hava some overlap, which can be handled by the tradeID which is unique.

I have tried some stuff like:

df2 = df0.join(df1, how='outer', lsuffix='_caller', rsuffix='_other')

and

df2 = df0.merge(df1, left_index=True, right_index=True)

But the results are respectively:

enter image description here

andenter image description here

I am looking for a union without overlap, could someone help me?

3
  • So when a tradeID is present in both data frames, what do you expect to appear in the merged result? Commented Jun 1, 2017 at 23:26
  • @IgorRaush, both rows would be exactly the same, I would like to keep just one of them, please also note that tradeID is an index Commented Jun 1, 2017 at 23:29
  • the code: df2 = df0.merge(df1, how='outer') works but it throws my indexes away Commented Jun 1, 2017 at 23:36

2 Answers 2

6

Seems like combine_first() should do it for you:

df2 = df0.combine_first(df1)

...where df0 takes precedence over df1 when the indicies match. Although in your case, if they're identical, it doesn't really matter. But if they're not identical, that's how combine_first() works.

The following is an example of it working with dummy data.

Code:

import pandas as pd
import io

a = io.StringIO(u'''
tradeID,amount,date
X001,100,1/1/2016
X002,200,1/2/2016
X003,300,1/3/2016
X005,500,1/5/2016
''')

b = io.StringIO(u'''
tradeID,amount,date
X004,400,1/4/2016
X005,500,1/5/2016
X006,600,1/6/2016
''')

dfA = pd.read_csv(a, index_col = 'tradeID')
dfB = pd.read_csv(b, index_col = 'tradeID')

df = dfA.combine_first(dfB)

Output:

         amount      date
tradeID                  
X001      100.0  1/1/2016
X002      200.0  1/2/2016
X003      300.0  1/3/2016
X004      400.0  1/4/2016
X005      500.0  1/5/2016
X006      600.0  1/6/2016

If you really want to use merge you can still do that, but you'll need to add some syntax to keep your indicies (more info):

df = dfA.reset_index().merge(dfB.reset_index(), how = 'outer').set_index('tradeID')

I ran super rudimentary timing on these two options and combine_first() consistently beat merge by nearly 3x on this very small data set.

...and Igor Raush's version tested at or slightly faster than combine_first().

Sign up to request clarification or add additional context in comments.

3 Comments

awesome! it worked exactly like i needed! thank you very much!
Glad I could help
Does this check for uniqueness across all fields or just the ID?
1

One way to accomplish this is

pd.concat([df0, df1]).loc[lambda df: ~df.index.duplicated()]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.