I am looking to compare two CSVs. Both CSVs will have nearly identical data, however the second CSV will have 2 identical rows that CSV 1 does not have. I would like the program to output both of those 2 rows so I can see which row is present in CSV 2, but not CSV 1, and how many times that row is present.
Here is my current logic:
import csv
import pandas as pd
import numpy as np
data1 = {"Col1": [0,1,1,2],
"Col2": [1,2,2,3],
"Col3": [5,2,1,1],
"Col4": [1,2,2,3]}
data2 = {"Col1": [0,1,1,2,4,4],
"Col2": [1,2,2,3,4,4],
"Col3": [5,2,1,1,4,4],
"Col4": [1,2,2,3,4,4]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
ds1 = set(tuple(line) for line in df1.values)
ds2 = set(tuple(line) for line in df2.values)
df = pd.DataFrame(list(ds2.difference(ds1)), columns=df2.columns)
print(df)
Here is my current outcome:
Col1 Col2 Col3 Col4
0 4 4 4 4
Here is my desired outcome:
Col1 Col2 Col3 Col4
0 4 4 4 4
1 4 4 4 4
As of right now, it only outputs the row once even though CSV has the row twice. What can I do so that it not only shows the missing row, but also for each time it is in the second CSV? Thanks in advance!