Compare two csv files in python and retain headers of changes

Question

I'm trying to compare two csv files in python and output the differences along with the headers of each column. So far, with what I'm doing, it outputs all columns instead of just the ones with differences

import csv

with open('firstfile.csv', 'r') as f1:
    file1 = f1.readlines()

with open('secondfile.csv', 'r') as f2:
    file2 = f2.readlines()

with open('results.csv', 'w') as outFile:
    outFile.write(file1[0])
    for line in file2:
        if line not in file1:
            outFile.write(line)

I just tested this code and it works fine for me so the problem lies elsewhere. — mVChr
– mVChr, Commented Oct 31, 2018 at 20:42
The issue is that it the out file prints all headers instead of only the ones that changed. I'm trying to print only the headers that changed. This might be bc it's evaluating lines therefore rows instead of columns (i'm not sure though) — user6302747
– user6302747, Commented Oct 31, 2018 at 20:50
Oh, yeah, this only outputs the rows that changed. You'd have to add a lot more code to do columns. — mVChr
– mVChr, Commented Oct 31, 2018 at 20:55
this isn't super clear, are the tables the same length? same number of columns? are the tableheads identical? which differences are you looking for — vencaslac
– vencaslac, Commented Oct 31, 2018 at 22:02

Ruslan Galimov · Accepted Answer · 2018-10-31 21:15:03Z

1

I think this code resolves your problem

import sys

with open('file1.csv', 'r') as f1:
    file1 = f1.readlines()

with open('file2.csv', 'r') as f2:
    file2 = f2.readlines()

delimiter = '\t'  # Column delimiter in you file
headers_of_first_file = file1[0].strip().split(delimiter)
headers_of_second_file = file2[0].strip().split(delimiter)

# You can remove this assert if you want to work files with different columns then you have to add some more code in next blocks
different_headers = set(headers_of_first_file).symmetric_difference(headers_of_second_file)
if different_headers:
    print('Files have difference in headers: ', different_headers)
    sys.exit(-1)

# Build map {header: [all_values]}
first_file_map = {header: [] for header in headers_of_first_file}
for row in file1[1:]:
    for index, cell in enumerate(row.strip().split(delimiter)):
        first_file_map[headers_of_first_file[index]].append(cell)

# Check by built map. Dont forget that columns may change order
result = set()
for row in file2[1:]:
    for index, cell in enumerate(row.strip().split(delimiter)):
        if cell not in first_file_map[headers_of_second_file[index]]:
            result.add(headers_of_second_file[index])

with open('results.csv', 'w') as out_file:
    out_file.write('\t'.join(result))

UPD files example:

Column1 Column2 Column3 Column5 Column4
1   2   3   5   4
10  20  30  50  40

Column1 Column2 Column3 Column4 Column5
11  2   3   4   5
10  10  30  40  50

'\t' is delimiter

edited Oct 31, 2018 at 21:15

answered Oct 31, 2018 at 21:10

Ruslan Galimov

2563 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user6302747 Over a year ago

tried it out, renders the following error: first_file_map[headers_of_first_file[index]].append(cell) IndexError: list index out of range could this possibly be because there are too many columns?

Ruslan Galimov Over a year ago

Can you give an example of files? Actually, if you have different number of columns and values in row such error may happens

user6302747 Over a year ago

Your solution worked.. the files have 100k+ rows and some data fields were missing. I used pandas to normalize the data set then ran the comparison again and it worked fine!

Ruslan Galimov Over a year ago

I was glad to help

user6302747 Over a year ago

Quick update to this.. how would I be able to handle if rows and columns are not identical?

|

cottontail · Accepted Answer · 2022-08-02 16:31:08Z

import csv

def compareList(l1,l2):
   if(len(l1)==len(l2) and len(l1)==sum([1 for i,j in zip(l1,l2) if i==j])):
      return "Equal"
   else:
      return "Non equal"

file1 = "C:/Users/Sarvesh/Downloads/a.csv"
file2 = "C:/Users/Sarvesh/Downloads/b.csv"

with open(file1, 'r') as csv1, open(file2, 'r') as csv2:  # Import CSV files
    import1 = csv1.readlines()
    import2 = csv2.readlines()

    # creating an object of csv reader
    # with the delimiter as ,
    csv_reader = csv.reader(import1, delimiter='|')
    # list to store the names of columns
    list_of_column_name1 = []
    # loop to iterate through the rows of csv
    for row in csv_reader:
        # adding the first row
        list_of_column_name1.append(row)
        # breaking the loop after the
        # first iteration itself
        break

    csv_reader = csv.reader(import2, delimiter='|')
    # list to store the names of columns
    list_of_column_name2 = []
    # loop to iterate through the rows of csv
    for row in csv_reader:
        # adding the first row
        list_of_column_name2.append(row)
        # breaking the loop after the
        # first iteration itself
        break

# printing the result
print("1List of column names : ", list_of_column_name1[0])

print("2List of column names : ", list_of_column_name2[0])

print("First comparison",compareList(list_of_column_name1,list_of_column_name2))

Collectives™ on Stack Overflow

Compare two csv files in python and retain headers of changes

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related