1

I have two Python Pandas Dataframe like below:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K1', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K1', 'K0', 'K0', 'K0', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6'],
                      'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                       'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
print left
    A   B key1 key2
0  A0  B0   K0   K0
1  A1  B1   K0   K1
2  A2  B2   K1   K0
3  A3  B3   K1   K0
4  A4  B4   K1   K0
5  A5  B5   K1   K0
6  A6  B6   K2   K1

print right
    C   D key1 key2
0  C0  D0   K0   K0
1  C1  D1   K1   K0
2  C2  D2   K1   K0
3  C3  D3   K2   K0

I want to merge this two Dataframe on certain condition:

  1. I want to merge this dataframes on key1 and key2.
  2. I want output Dataframe with only rows available in the left dataframe. (If the key1 and key2 of left Dataframe is available in the key1 and key2 of the right Dataframe then fulfill the current row by appending column C and D)

I tried like below:

result = pd.merge(left, right, on=['key1', 'key2'],how = 'left')
print result

# Output - which is not expected for me
     A   B key1 key2    C    D
0   A0  B0   K0   K0   C0   D0
1   A1  B1   K0   K1  NaN  NaN
2   A2  B2   K1   K0   C1   D1
3   A2  B2   K1   K0   C2   D2
4   A3  B3   K1   K0   C1   D1
5   A3  B3   K1   K0   C2   D2
6   A4  B4   K1   K0   C1   D1
7   A4  B4   K1   K0   C2   D2
8   A5  B5   K1   K0   C1   D1
9   A5  B5   K1   K0   C2   D2
10  A6  B6   K2   K1  NaN  NaN

If I drop the duplicate entries from the left Dataframe then also I didn't get the expected output.

Then, I tried to remove duplicate entries from the result Dataframe then I got below Dataframe:

result.drop_duplicates(subset = ['A','B','key1','key2'], inplace=True)
print result

     A   B key1 key2    C    D
0   A0  B0   K0   K0   C0   D0
1   A1  B1   K0   K1  NaN  NaN
2   A2  B2   K1   K0   C1   D1
4   A3  B3   K1   K0   C1   D1
6   A4  B4   K1   K0   C1   D1
8   A5  B5   K1   K0   C1   D1
10  A6  B6   K2   K1  NaN  NaN

The problem is column C and D contains same values which is because of default keep=First behavior. I want to change this behavior so that I can get below expected Dataframe.

Edit :

If the key1 and key2 of left Dataframe match in the key1 and key2 of right Dataframe then append C and D columns to that row and If column C and D are already appended to the same row then keep C and D as Nan.

Expected Output:

     A   B key1 key2    C    D
0   A0  B0   K0   K0   C0   D0
1   A1  B1   K0   K1  NaN  NaN
2   A2  B2   K1   K0   C1   D1
3   A3  B3   K1   K0   C2   D2
4   A4  B4   K1   K0   NaN  Nan
5   A5  B5   K1   K0   NaN  Nan
6   A6  B6   K2   K1  NaN  NaN
3
  • df1 has four rows with K1 and K0. df2 has two such rows. Your expected results consist of four such rows that match df1 where we alternate through available rows from df1. What happens if there were 3 such rows in df1 and two such rows in df2 Commented Oct 27, 2017 at 9:12
  • @piRSquared- Actually column C and D is important for me so I want to take all entries from column C and D which are unique for particular match. In that case (3 rows in df1 case), first 2 rows should take C1 and D1 and third row take C2 and D2. Commented Oct 27, 2017 at 9:16
  • @piRSquared- Please see edit. Can we do like this? Commented Oct 27, 2017 at 10:42

1 Answer 1

2

Unfortunately, your question does not describe what you want to achieve in a way that it could become useful to anyone with a similar problem.

Indeed, you wanted to obtain a sorted merge for repeated merging keys.

The logical way to proceed is add a sequence number to make the merging multiple key unique. Then what follows is a trivial merge.

left['Order'] = left.groupby(['key1','key2']).cumcount()
right['Order'] = right.groupby(['key1','key2']).cumcount()

result = left.merge(right, how='left', 
                    on=['key1','key2','Order']).drop('Order',axis=1)
Sign up to request clarification or add additional context in comments.

1 Comment

This is the exactly what I want. groupby and cumcount is the key!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.