55

I am currently merging two dataframes with an inner join. However, after merging, I see all the rows are duplicated even when the columns that I merged upon contain the same values.

Specifically, I have the following code.

merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')

Here are the two dataframes and the results.

df1

          email_address    name   surname
0  [email protected]    john     smith
1  [email protected]    john     smith
2       [email protected]   elvis   presley

df2

          email_address    street  city
0  [email protected]   street1    NY
1  [email protected]   street1    NY
2       [email protected]   street2    LA

merged_df

          email_address    name   surname    street  city
0  [email protected]    john     smith   street1    NY
1  [email protected]    john     smith   street1    NY
2  [email protected]    john     smith   street1    NY
3  [email protected]    john     smith   street1    NY
4       [email protected]   elvis   presley   street2    LA
5       [email protected]   elvis   presley   street2    LA

My question is, shouldn't it be like this?

This is how I would like my merged_df to be like.

          email_address    name   surname    street  city
0  [email protected]    john     smith   street1    NY
1  [email protected]    john     smith   street1    NY
2       [email protected]   elvis   presley   street2    LA

Are there any ways I can achieve this?

0

3 Answers 3

61
list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])

enter image description here

The duplicate rows are expected. Each john smith in list_1 matches with each john smith in list_2. I had to drop the duplicates in one of the lists. I chose list_2.

Sign up to request clarification or add additional context in comments.

Comments

14

DO NOT drop duplicates BEFORE the merge, but after!

Best solution is do the merge and then drop the duplicates.

In your case:

merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
merged_df.drop_duplicates(subset=['email_address'], keep='first', inplace=True, ignore_index=True)

4 Comments

Why do we need to drop the duplicates after the merge? and why does the merge generate duplciates? Could you help me to understand? Many thanks.
@user19562955 Merge is a mathematical operation on sets and it works the same way. When you have more than one match of a value in the second set, it puts it in the final dataframe. In the example above, given that the key was "email_address", the question is: does the email in the first line of df1 have a correspondent in any line of df2? Yes (all matching lines are added). And so on. In practice this happens very often and the best way to keep only distinct records is to handle the duplicates after the merge.
this solution created an empty data frame for me
@coding_is_fun use print function after the merge and see if the data frame is empty. The method drop_duplicates would not make an empty data frame. Something wrong with your merge operation is more probably.
3

To make sure you don't have duplicates in your keys, you can use the validate parameter:

validate : str, optional

If specified, checks if merge is of specified type.

  • “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
  • “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
  • “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
  • “many_to_many” or “m:m”: allowed, but does not result in checks.

In your case, you don't want any duplicate keys in the "right" dataframe df2, so you need to set validate to many_to_one.

df1.merge(df2, on=['email_address'], validate='many_to_one')

If you have duplicate keys in df2, the function will return this error:

MergeError: Merge keys are not unique in right record; not a many-to-one merge

To drop duplicate keys in df2 and do a merge you can use:

keys = ['email_address']
df1.merge(df2.drop_duplicates(subset=keys), on=keys)

Make sure you set the subset parameter in drop_duplicates to the key columns you are using to merge. If you don't specify a subset drop_duplicates will compare all columns and if some of them have different values it will not drop those rows.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.