57

For example, I have two tables (DataFrames):

a:

A  B  value1
1  1      23
1  2      34
2  1    2342
2  2     333

and b:

A  B  value2
1  1    0.10
1  2    0.20
2  1    0.13
2  2    0.33

The desired result is:

A  B  value1  value2
1  1      23    0.10
1  2      34    0.20
2  1    2342    0.13
2  2     333    0.33

Does pandas have any functions to support merge (or join) two tables based on multiple keys?

0

3 Answers 3

112

To merge by multiple keys, you just need to pass the keys in a list to pd.merge:

>>> pd.merge(a, b, on=['A', 'B'])
   A  B  value1  value2
0  1  1      23    0.10
1  1  2      34    0.20
2  2  1    2342    0.13
3  2  2     333    0.33

In fact, the default for pd.merge is to use the intersection of the two DataFrames' column labels, so pd.merge(a, b) would work equally well in this case.

Sign up to request clarification or add additional context in comments.

Comments

11

According to the most recent pandas documentation, the on parameter accepts either a label or list on the field name and must be found in both data frames. Here is an MWE for its use:

a = pd.DataFrame({'A':['0', '0', '1','1'],'B':['0', '1', '0','1'], 'v':True, False, False, True]})

b = pd.DataFrame({'A':['0', '0', '1','1'], 'B':['0', '1', '0','1'],'v':[False, True, True, True]})

result = pd.merge(a, b, on=['A','B'], how='inner', suffixes=['_and', '_or'])
>>> result
    A   B   v_and   v_or

0   0   0   True    False
1   0   1   False   True
2   1   0   False   True
3   1   1   True    True

on : label or list Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

Check out latest pd.merge documentation for further details.

Comments

2

You can also use left_on=, right_on=, left_index= or right_index= parameters as well. The values are matched in the order the keys are passed in that case; the first key in left_on will be matched with the first key in right_on etc.

So using the example in the OP, the following two produce the same output:

a.merge(b, left_on=['A', 'B'], right_on=['A', 'B'])
a.merge(b, on=['A', 'B'])

However, a.merge(b, left_on=['A', 'B'], right_on=['B', 'A']) will produce a very different output because a['A'] is matched to b['B'] and a['B'] is matched to b['A'].

This is especially useful if the keys to match are named differently. For example:

a.merge(b, left_on=['A1', 'A2'], right_on=['B1', 'B2'])

This is equivalent to the SQL query:

SELECT * FROM a INNER JOIN b ON a.A1=b.B1 AND a.A2=b.B2

A useful note: Because reindexing occurs under the hood (source), the merged output is sorted by the values in the left keys (apparently that's not the case in cudf 24.02 but that's another matter).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.