Finding diffs between two CSV files with primary column

Question

I have two CSV files:

File 1

Id, 1st, 2nd
1, first, row
2, second, row

File 2

Id, 1st, 2nd
1, first, row
2, second, line
3, third, row

I am just starting in python and need to write some code, which can do the diff on these files based on primary columns and in this case first column "Id". Output file should be a delta file which should identify the rows that have changed in the second file:

Output delta file

2, second, line
3, third, row

Enrico Gandini · Accepted Answer · 2021-02-12 11:28:53Z

2

I suggest you load both CSV files as Pandas DataFrames, and then you use and outer merge with indicator to know what rows changed in the second file. Then, you use query to get only the rows that changed in the second file, and you drop the indicator column ('_merge').

import pandas as pd


df1 = pd.read_csv("FILENAME_1.csv")
df2 = pd.read_csv("FILENAME_2.csv")

merged = pd.merge(df1, df2, how="outer", indicator=True)
diff = merged.query("_merge == 'right_only'").drop("_merge", axis="columns")

For further details on finding differences in Pandas DataFrames, read this other question.

answered Feb 12, 2021 at 11:28

Enrico Gandini

1,0451 gold badge11 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tanatan Over a year ago

Thank you Enrico. I will use this approach as it seems to provide more options on merging.

foglerit · Accepted Answer · 2021-02-12 12:18:23Z

1

I'd also use pandas, as Enrico suggested, for anything more complex than your example. But if you want to do it in pure Python, you can convert your rows into sets and compute a set difference:

import csv
from io import StringIO

data1 = """Id, 1st, 2nd
1, first, row
2, second, row"""
data2 = """Id, 1st, 2nd
1, first, row
2, second, line
3, third, row"""

s1 = {tuple(row) for row in csv.reader(StringIO(data1))}
s2 = {tuple(row) for row in csv.reader(StringIO(data2))}
print(s2-s1)

print(s2-s1)
{('2', ' second', ' line'), ('3', ' third', ' row')}

Note that in your example you are not actually diffing based on your primary column only, but on the entire row. If you really want to only consider the Id column, you can do:

d1 = {row[0]:row[1:] for row in csv.reader(StringIO(data1))}
d2 = {row[0]:row[1:] for row in csv.reader(StringIO(data2))}
diff = { k : d2[k] for k in set(d2) - set(d1)}
print(diff)
{'3': [' third', ' row']}

answered Feb 12, 2021 at 12:18

foglerit

8,4469 gold badges53 silver badges70 bronze badges

1 Comment

tanatan Over a year ago

Thank you forlerit. New to Python and useful to know different ways of doing things

Collectives™ on Stack Overflow

Finding diffs between two CSV files with primary column

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related