6

I have a DataFrame that I want to merge and drop only duplicates values based on column name and row. For example, key_x and key_y has the same values in the same row in row 0,3,10,12,15.

My DataFrame

import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'A'], 'value1': [1, 2, 3, 4]})

I have tried this code below which does work when merging but how do I drop the duplicates values based on column name if the value in the same row.

merged_df = df1.merge(df1, how='cross')
print(merged_df)
  key_x  value1_x key_y  value1_y
0      A         1     A         1 # Duplicate A
1      A         1     B         2
2      A         1     C         3
3      A         1     A         4 # Duplicate A
4      B         2     A         1
5      B         2     B         2
6      B         2     C         3
7      B         2     A         4
8      C         3     A         1
9      C         3     B         2
10     C         3     C         3 # Duplicate C
11     C         3     A         4
12     A         4     A         1 # Duplicate A
13     A         4     B         2
14     A         4     C         3
15     A         4     A         4 # Duplicate A

I would like my result to be something like this:

   key_x  value1_x key_y  value1_y
1      A         1     B         2
2      A         1     C         3
4      B         2     A         1
6      B         2     C         3
7      B         2     A         4
8      C         3     A         1
9      C         3     B         2
11     C         3     A         4
13     A         4     B         2
14     A         4     C         3

2 Answers 2

6

You can filter out rows where the keys are the same — that removes all duplicates like (A, A) or (C, C).

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'A'], 'value1': [1, 2, 3, 4]})

merged_df = df1.merge(df1, how='cross')

# Keep only rows where key_x != key_y
filtered_df = merged_df[merged_df['key_x'] != merged_df['key_y']]

print(filtered_df)

Output will be the one you want:

   key_x  value1_x key_y  value1_y
1      A         1     B         2
2      A         1     C         3
4      B         2     A         1
6      B         2     C         3
7      B         2     A         4
8      C         3     A         1
9      C         3     B         2
11     C         3     A         4
13     A         4     B         2
14     A         4     C         3

If you instead want to drop only rows where both the key and value are identical, use:

filtered_df = merged_df[
    ~((merged_df['key_x'] == merged_df['key_y']) &
      (merged_df['value1_x'] == merged_df['value1_y']))
]

Both approaches depend on how strictly you define “duplicate.”

Sign up to request clarification or add additional context in comments.

Comments

3

In this case, I like readability, I would use query:

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'A'], 'value1': [1, 2, 3, 4]})
merged_df = df1.merge(df1, how='cross')

df_merged_filtered = merged_df.query('key_x != key_y')

print(df_merged_filtered)

Output:

   key_x  value1_x key_y  value1_y
1      A         1     B         2
2      A         1     C         3
4      B         2     A         1
6      B         2     C         3
7      B         2     A         4
8      C         3     A         1
9      C         3     B         2
11     C         3     A         4
13     A         4     B         2
14     A         4     C         3

And much like @charly_0x13 does in the their solution, you can filter on both like this:

df_merged_filtered = merged_df.query('key_x != key_y and value1_x != value1_y')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.