Merge two DataFrames based on multiple keys in pandas

Question

For example, I have two tables (DataFrames):

a:

A  B  value1
1  1      23
1  2      34
2  1    2342
2  2     333

and b:

A  B  value2
1  1    0.10
1  2    0.20
2  1    0.13
2  2    0.33

The desired result is:

A  B  value1  value2
1  1      23    0.10
1  2      34    0.20
2  1    2342    0.13
2  2     333    0.33

Does pandas have any functions to support merge (or join) two tables based on multiple keys?

Alex Riley · Accepted Answer · 2015-08-28 18:25:33Z

112

To merge by multiple keys, you just need to pass the keys in a list to pd.merge:

>>> pd.merge(a, b, on=['A', 'B'])
   A  B  value1  value2
0  1  1      23    0.10
1  1  2      34    0.20
2  2  1    2342    0.13
3  2  2     333    0.33

In fact, the default for pd.merge is to use the intersection of the two DataFrames' column labels, so pd.merge(a, b) would work equally well in this case.

answered Aug 28, 2015 at 18:25

Alex Riley

178k46 gold badges274 silver badges247 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Miguel Rueda · Accepted Answer · 2021-08-12 19:38:00Z

According to the most recent pandas documentation, the on parameter accepts either a label or list on the field name and must be found in both data frames. Here is an MWE for its use:

a = pd.DataFrame({'A':['0', '0', '1','1'],'B':['0', '1', '0','1'], 'v':True, False, False, True]})

b = pd.DataFrame({'A':['0', '0', '1','1'], 'B':['0', '1', '0','1'],'v':[False, True, True, True]})

result = pd.merge(a, b, on=['A','B'], how='inner', suffixes=['_and', '_or'])
>>> result
    A   B   v_and   v_or

0   0   0   True    False
1   0   1   False   True
2   1   0   False   True
3   1   1   True    True

on : label or list Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

Check out latest pd.merge documentation for further details.

cottontail · Accepted Answer · 2024-02-19 04:12:25Z

You can also use left_on=, right_on=, left_index= or right_index= parameters as well. The values are matched in the order the keys are passed in that case; the first key in left_on will be matched with the first key in right_on etc.

So using the example in the OP, the following two produce the same output:

a.merge(b, left_on=['A', 'B'], right_on=['A', 'B'])
a.merge(b, on=['A', 'B'])

However, a.merge(b, left_on=['A', 'B'], right_on=['B', 'A']) will produce a very different output because a['A'] is matched to b['B'] and a['B'] is matched to b['A'].

This is especially useful if the keys to match are named differently. For example:

a.merge(b, left_on=['A1', 'A2'], right_on=['B1', 'B2'])

This is equivalent to the SQL query:

SELECT * FROM a INNER JOIN b ON a.A1=b.B1 AND a.A2=b.B2

A useful note: Because reindexing occurs under the hood (source), the merged output is sorted by the values in the left keys (apparently that's not the case in cudf 24.02 but that's another matter).

Collectives™ on Stack Overflow

Merge two DataFrames based on multiple keys in pandas

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related