0

Related to: How to find row with same value in 2 columns between 2 dataframes but different values in other columns pandas

I have two DataFrames: df1 and df2.

I would like to find all the rows in these combined DataFrames that have identical values in 'columnA' (object) and 'columnB' (int). These rows will have differing values in other columns I don't care about. The shape of these DataFrames also differs.

I've tried something like:

concat = pd.concat([df1, df2])
overlap = concat[concat.duplicated(subset=['columnA','columnB'], keep=False)]

But the output doesn't look right (maybe it is). Just want to check - am I missing anything?

Edit:

Say I wanted all the rows with the same value in columnA but different values in columnB - would this work?

df3 = (concat[concat.duplicated(subset=['columnA'], keep=False)]
           .drop_duplicates(subset=['columnB']))
3
  • 1
    Have you tried pd.merge()? Commented Sep 16, 2019 at 22:11
  • Regarding your edit: are you trying to merge two df's, each with their own respective columnA and columnB? Furthermore, if columnA_df == columnA_df2 but columnB_df == columnB_df2, drop row? Commented Sep 16, 2019 at 22:27
  • Apologies - looking for separate outputs. All rows with identical values in columnA and columnB. Separately, in separate output, all rows with identical values in columnA but different values in columnB. Commented Sep 16, 2019 at 22:29

1 Answer 1

1

You can use pd.merge

df1 = pd.DataFrame(data=[('A','B','C'),('E','F','G'),('A','B','F')], columns=['columnA','columnB','columnC'])
df2 = pd.DataFrame(data=[('X','Y','G'),('A','B','Y'),('A','C','F')], columns=['columnA','columnB','columnC'])

df2['columnB'] = df2['columnB'].astype(str) #convert to string

print(df1)
  columnA columnB columnC
0       A       B       C
1       E       F       G
2       A       B       F

print(df2)
 columnA columnB columnC
0       X       Y       G
1       A       B       Y
2       A       C       F

And then after applying pd.merge:

df_m = pd.merge(df1,df2,how='inner',on='columnA')

----
df_m
  columnA columnB_x columnC_x columnB_y columnC_y
0       A         B         C         B         Y
1       A         B         C         C         F
2       A         B         F         B         Y
3       A         B         F         C         F

Regarding your edit, try this:

df_final = df_m[df_m['columnB_x'] != df_m['columnB_y']]

------
print(df_final)
  columnA columnB_x columnC_x columnB_y columnC_y
1       A         B         C         C         F
3       A         B         F         C         F
Sign up to request clarification or add additional context in comments.

3 Comments

Will update original post with datatypes - ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
Is columnA a string object?
ok final update complete. please vote as answer if this is sufficient.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.