1

I have two CSV files:

File 1

Id, 1st, 2nd
1, first, row
2, second, row

File 2

Id, 1st, 2nd
1, first, row
2, second, line
3, third, row

I am just starting in python and need to write some code, which can do the diff on these files based on primary columns and in this case first column "Id". Output file should be a delta file which should identify the rows that have changed in the second file:

Output delta file

2, second, line
3, third, row

2 Answers 2

2

I suggest you load both CSV files as Pandas DataFrames, and then you use and outer merge with indicator to know what rows changed in the second file. Then, you use query to get only the rows that changed in the second file, and you drop the indicator column ('_merge').

import pandas as pd


df1 = pd.read_csv("FILENAME_1.csv")
df2 = pd.read_csv("FILENAME_2.csv")

merged = pd.merge(df1, df2, how="outer", indicator=True)
diff = merged.query("_merge == 'right_only'").drop("_merge", axis="columns")

For further details on finding differences in Pandas DataFrames, read this other question.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Enrico. I will use this approach as it seems to provide more options on merging.
1

I'd also use pandas, as Enrico suggested, for anything more complex than your example. But if you want to do it in pure Python, you can convert your rows into sets and compute a set difference:

import csv
from io import StringIO

data1 = """Id, 1st, 2nd
1, first, row
2, second, row"""
data2 = """Id, 1st, 2nd
1, first, row
2, second, line
3, third, row"""

s1 = {tuple(row) for row in csv.reader(StringIO(data1))}
s2 = {tuple(row) for row in csv.reader(StringIO(data2))}
print(s2-s1)

print(s2-s1)
{('2', ' second', ' line'), ('3', ' third', ' row')}

Note that in your example you are not actually diffing based on your primary column only, but on the entire row. If you really want to only consider the Id column, you can do:

d1 = {row[0]:row[1:] for row in csv.reader(StringIO(data1))}
d2 = {row[0]:row[1:] for row in csv.reader(StringIO(data2))}
diff = { k : d2[k] for k in set(d2) - set(d1)}
print(diff)
{'3': [' third', ' row']}

1 Comment

Thank you forlerit. New to Python and useful to know different ways of doing things

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.