Duplicated rows when merging dataframes in Python

Question

I am currently merging two dataframes with an inner join. However, after merging, I see all the rows are duplicated even when the columns that I merged upon contain the same values.

Specifically, I have the following code.

merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')

Here are the two dataframes and the results.

df1

          email_address    name   surname
0  [email protected]    john     smith
1  [email protected]    john     smith
2       [email protected]   elvis   presley

df2

          email_address    street  city
0  [email protected]   street1    NY
1  [email protected]   street1    NY
2       [email protected]   street2    LA

merged_df

          email_address    name   surname    street  city
0  [email protected]    john     smith   street1    NY
1  [email protected]    john     smith   street1    NY
2  [email protected]    john     smith   street1    NY
3  [email protected]    john     smith   street1    NY
4       [email protected]   elvis   presley   street2    LA
5       [email protected]   elvis   presley   street2    LA

My question is, shouldn't it be like this?

This is how I would like my merged_df to be like.

          email_address    name   surname    street  city
0  [email protected]    john     smith   street1    NY
1  [email protected]    john     smith   street1    NY
2       [email protected]   elvis   presley   street2    LA

Are there any ways I can achieve this?

piRSquared · Accepted Answer · 2016-08-18 13:50:11Z

61

list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])

The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.

edited Aug 18, 2016 at 13:50

answered Aug 18, 2016 at 13:40

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wjandrea · Accepted Answer · 2022-06-27 19:27:59Z

14

DO NOT drop duplicates BEFORE the merge, but after!

Best solution is do the merge and then drop the duplicates.

In your case:

merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
merged_df.drop_duplicates(subset=['email_address'], keep='first', inplace=True, ignore_index=True)

edited Jun 27, 2022 at 19:27

wjandrea

34k10 gold badges69 silver badges105 bronze badges

answered May 3, 2022 at 21:21

Rafael Amaral

2814 silver badges5 bronze badges

4 Comments

user19562955 Over a year ago

Why do we need to drop the duplicates after the merge? and why does the merge generate duplciates? Could you help me to understand? Many thanks.

Rafael Amaral Over a year ago

@user19562955 Merge is a mathematical operation on sets and it works the same way. When you have more than one match of a value in the second set, it puts it in the final dataframe. In the example above, given that the key was "email_address", the question is: does the email in the first line of df1 have a correspondent in any line of df2? Yes (all matching lines are added). And so on. In practice this happens very often and the best way to keep only distinct records is to handle the duplicates after the merge.

coding_is_fun Over a year ago

this solution created an empty data frame for me

Rafael Amaral Over a year ago

@coding_is_fun use print function after the merge and see if the data frame is empty. The method drop_duplicates would not make an empty data frame. Something wrong with your merge operation is more probably.

Mykola Zotko · Accepted Answer · 2023-09-11 06:09:42Z

To make sure you don't have duplicates in your keys, you can use the validate parameter:

validate : str, optional

If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.

“one_to_many” or “1:m”: check if merge keys are unique in left dataset.

“many_to_one” or “m:1”: check if merge keys are unique in right dataset.

“many_to_many” or “m:m”: allowed, but does not result in checks.

In your case, you don't want any duplicate keys in the "right" dataframe df2, so you need to set validate to many_to_one.

df1.merge(df2, on=['email_address'], validate='many_to_one')

If you have duplicate keys in df2, the function will return this error:

MergeError: Merge keys are not unique in right record; not a many-to-one merge

To drop duplicate keys in df2 and do a merge you can use:

keys = ['email_address']
df1.merge(df2.drop_duplicates(subset=keys), on=keys)

Make sure you set the subset parameter in drop_duplicates to the key columns you are using to merge. If you don't specify a subset drop_duplicates will compare all columns and if some of them have different values it will not drop those rows.

Collectives™ on Stack Overflow

Duplicated rows when merging dataframes in Python

3 Answers 3

Comments

4 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Linked

Related