comparing two dataframes and finding what is missing from each dataframe in Python

Question

I am having two dataframes which have exact same data structure. I need to compare them to see if they have any difference in records due to any column value being different.

I am using below code to do it and it works perfectly to report if things tie or untie between these two dataframes.

df=pd.concat([df1, df2])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
if df.reindex(idx).empty:
    print('everything is good.')
else:
    print('things do not tie out')
    df.reindex(idx).to_csv('diff.csv', index=False)

Though diff.csv tells me what all is missing or is different, what it doesn't tell is which record belonged to which dataframe initially and which column values differ between the initial dataframes for a given record. Is there a way to somehow get this information in my final output ?

Sample dataframes.

   Name | Age| Gender
0| Naxi | 27 | Male
1| Karan| 25 | Male
2| Tanya| 27 | Female


   Name | Age| Gender
0| Naxi | 27 | Male
1| Tanya| 27 | Female
2| Karan| 24 | Male

output I want

   Name | Age| Gender | Dataframe
   Karan| 24 | Male   | df2
   Karan| 25 | Male   | df1

can you add sample dataframes?

Nk03
– Nk03

2021-04-19 14:59:58 +00:00
Commented Apr 19, 2021 at 14:59 — Nk03
– Nk03, Commented Apr 19, 2021 at 14:59
@Nk03 added the dataframes

Naxi
– Naxi

2021-04-19 15:15:37 +00:00
Commented Apr 19, 2021 at 15:15 — Naxi
– Naxi, Commented Apr 19, 2021 at 15:15

Nk03 · Accepted Answer · 2021-04-19 15:30:02Z

3

You can add 1 column to each dataframe and then ignore that column while dropping duplicates (after pd.concat).

df1['Dataframe'] = 'df1'
df2['Dataframe'] = 'df2'
df=pd.concat([df1, df2])
diff_df =  df.drop_duplicates(subset=['Name', 'Age', 'Gender'], keep=False)
print(diff_df)

Output -

    Name  Age Gender Dataframe
2  Karan   24   Male       df1
1  Karan   25   Male       df2

Index in the output will help you to locate the correct row in the initial dataframe.

answered Apr 19, 2021 at 15:30

Nk03

15k2 gold badges11 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

comparing two dataframes and finding what is missing from each dataframe in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related