Identifying rows shared between two Pandas DataFrames based on two columns

Question

Related to: How to find row with same value in 2 columns between 2 dataframes but different values in other columns pandas

I have two DataFrames: df1 and df2.

I would like to find all the rows in these combined DataFrames that have identical values in 'columnA' (object) and 'columnB' (int). These rows will have differing values in other columns I don't care about. The shape of these DataFrames also differs.

I've tried something like:

concat = pd.concat([df1, df2])
overlap = concat[concat.duplicated(subset=['columnA','columnB'], keep=False)]

But the output doesn't look right (maybe it is). Just want to check - am I missing anything?

Edit:

Say I wanted all the rows with the same value in columnA but different values in columnB - would this work?

df3 = (concat[concat.duplicated(subset=['columnA'], keep=False)]
           .drop_duplicates(subset=['columnB']))

Regarding your edit: are you trying to merge two df's, each with their own respective columnA and columnB? Furthermore, if columnA_df == columnA_df2 but columnB_df == columnB_df2, drop row? — ParalysisByAnalysis
– ParalysisByAnalysis, Commented Sep 16, 2019 at 22:27
Apologies - looking for separate outputs. All rows with identical values in columnA and columnB. Separately, in separate output, all rows with identical values in columnA but different values in columnB. — Cactus Philosopher
– Cactus Philosopher, Commented Sep 16, 2019 at 22:29

ParalysisByAnalysis · Accepted Answer · 2019-09-16 22:42:57Z

1

You can use pd.merge

df1 = pd.DataFrame(data=[('A','B','C'),('E','F','G'),('A','B','F')], columns=['columnA','columnB','columnC'])
df2 = pd.DataFrame(data=[('X','Y','G'),('A','B','Y'),('A','C','F')], columns=['columnA','columnB','columnC'])

df2['columnB'] = df2['columnB'].astype(str) #convert to string

print(df1)
  columnA columnB columnC
0       A       B       C
1       E       F       G
2       A       B       F

print(df2)
 columnA columnB columnC
0       X       Y       G
1       A       B       Y
2       A       C       F

And then after applying pd.merge:

df_m = pd.merge(df1,df2,how='inner',on='columnA')

----
df_m
  columnA columnB_x columnC_x columnB_y columnC_y
0       A         B         C         B         Y
1       A         B         C         C         F
2       A         B         F         B         Y
3       A         B         F         C         F

Regarding your edit, try this:

df_final = df_m[df_m['columnB_x'] != df_m['columnB_y']]

------
print(df_final)
  columnA columnB_x columnC_x columnB_y columnC_y
1       A         B         C         C         F
3       A         B         F         C         F

edited Sep 16, 2019 at 22:42

answered Sep 16, 2019 at 22:15

ParalysisByAnalysis

7331 gold badge5 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Cactus Philosopher Over a year ago

Will update original post with datatypes - ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

ParalysisByAnalysis Over a year ago

Is columnA a string object?

ParalysisByAnalysis Over a year ago

ok final update complete. please vote as answer if this is sufficient.

Collectives™ on Stack Overflow

Identifying rows shared between two Pandas DataFrames based on two columns

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related