How to merge two pandas dataframe with left join when both dataframe contains duplicate keys?

Question

I have two Python Pandas Dataframe like below:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K1', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K1', 'K0', 'K0', 'K0', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6'],
                      'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                       'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
print left
    A   B key1 key2
0  A0  B0   K0   K0
1  A1  B1   K0   K1
2  A2  B2   K1   K0
3  A3  B3   K1   K0
4  A4  B4   K1   K0
5  A5  B5   K1   K0
6  A6  B6   K2   K1

print right
    C   D key1 key2
0  C0  D0   K0   K0
1  C1  D1   K1   K0
2  C2  D2   K1   K0
3  C3  D3   K2   K0

I want to merge this two Dataframe on certain condition:

I want to merge this dataframes on key1 and key2.
I want output Dataframe with only rows available in the left dataframe. (If the key1 and key2 of left Dataframe is available in the key1 and key2 of the right Dataframe then fulfill the current row by appending column C and D)

I tried like below:

result = pd.merge(left, right, on=['key1', 'key2'],how = 'left')
print result

# Output - which is not expected for me
     A   B key1 key2    C    D
0   A0  B0   K0   K0   C0   D0
1   A1  B1   K0   K1  NaN  NaN
2   A2  B2   K1   K0   C1   D1
3   A2  B2   K1   K0   C2   D2
4   A3  B3   K1   K0   C1   D1
5   A3  B3   K1   K0   C2   D2
6   A4  B4   K1   K0   C1   D1
7   A4  B4   K1   K0   C2   D2
8   A5  B5   K1   K0   C1   D1
9   A5  B5   K1   K0   C2   D2
10  A6  B6   K2   K1  NaN  NaN

If I drop the duplicate entries from the left Dataframe then also I didn't get the expected output.

Then, I tried to remove duplicate entries from the result Dataframe then I got below Dataframe:

result.drop_duplicates(subset = ['A','B','key1','key2'], inplace=True)
print result

     A   B key1 key2    C    D
0   A0  B0   K0   K0   C0   D0
1   A1  B1   K0   K1  NaN  NaN
2   A2  B2   K1   K0   C1   D1
4   A3  B3   K1   K0   C1   D1
6   A4  B4   K1   K0   C1   D1
8   A5  B5   K1   K0   C1   D1
10  A6  B6   K2   K1  NaN  NaN

The problem is column C and D contains same values which is because of default keep=First behavior. I want to change this behavior so that I can get below expected Dataframe.

Edit :

If the key1 and key2 of left Dataframe match in the key1 and key2 of right Dataframe then append C and D columns to that row and If column C and D are already appended to the same row then keep C and D as Nan.

Expected Output:

     A   B key1 key2    C    D
0   A0  B0   K0   K0   C0   D0
1   A1  B1   K0   K1  NaN  NaN
2   A2  B2   K1   K0   C1   D1
3   A3  B3   K1   K0   C2   D2
4   A4  B4   K1   K0   NaN  Nan
5   A5  B5   K1   K0   NaN  Nan
6   A6  B6   K2   K1  NaN  NaN

df1 has four rows with K1 and K0. df2 has two such rows. Your expected results consist of four such rows that match df1 where we alternate through available rows from df1. What happens if there were 3 such rows in df1 and two such rows in df2 — piRSquared
– piRSquared, Commented Oct 27, 2017 at 9:12
@piRSquared- Actually column C and D is important for me so I want to take all entries from column C and D which are unique for particular match. In that case (3 rows in df1 case), first 2 rows should take C1 and D1 and third row take C2 and D2. — ketan
– ketan, Commented Oct 27, 2017 at 9:16

HerrIvan · Accepted Answer · 2018-06-24 18:14:24Z

2

Unfortunately, your question does not describe what you want to achieve in a way that it could become useful to anyone with a similar problem.

Indeed, you wanted to obtain a sorted merge for repeated merging keys.

The logical way to proceed is add a sequence number to make the merging multiple key unique. Then what follows is a trivial merge.

left['Order'] = left.groupby(['key1','key2']).cumcount()
right['Order'] = right.groupby(['key1','key2']).cumcount()

result = left.merge(right, how='left', 
                    on=['key1','key2','Order']).drop('Order',axis=1)

answered Jun 24, 2018 at 18:14

HerrIvan

6595 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

agfe2 Over a year ago

This is the exactly what I want. groupby and cumcount is the key!

Collectives™ on Stack Overflow

How to merge two pandas dataframe with left join when both dataframe contains duplicate keys?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related