Python : Compare two csv files and print out differences

Question

I need to compare two CSV files and print out differences in a third CSV file. In my case, the first CSV is a old list of hash named old.csv and the second CSV is the new list of hash which contains both old and new hash.

Here is my code :

import csv
t1 = open('old.csv', 'r')
t2 = open('new.csv', 'r')
fileone = t1.readlines()
filetwo = t2.readlines()
t1.close()
t2.close()

outFile = open('update.csv', 'w')
x = 0
for i in fileone:
    if i != filetwo[x]:
        outFile.write(filetwo[x])
    x += 1
outFile.close()

The third file is a copy of the old one and not the update. What's wrong ? I Hope you can help me, many thanks !!

PS : i don't want to use diff

Not an answer, but a comment: under Linux, you can simply do diff file1 file2 on the command line. — Jan
– Jan, Commented Aug 17, 2016 at 12:01
Look at difflib see: stackoverflow.com/questions/19120489/… — Chris_Rands
– Chris_Rands, Commented Aug 17, 2016 at 12:03
You need to be more precise as to what a "difference" is and how to print it. What is a line is in the old file but not in the new? If a line is in the new file but not in the old? If two consecutive lines are swapped? If a line is moved to another position? Details like these make it hard to compare DNA sequences, for example, but you need to be sure exactly what you mean in your problem. — Rory Daulton
– Rory Daulton, Commented Aug 17, 2016 at 12:10
@Chris_Rands because I need to use CSV again for other things like SQL insert etc. — Nick Yellow
– Nick Yellow, Commented Aug 17, 2016 at 12:11

Chris Mueller · Accepted Answer · 2016-08-17 12:14:29Z

64

The problem is that you are comparing each line in fileone to the same line in filetwo. As soon as there is an extra line in one file you will find that the lines are never equal again. Try this:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open('update.csv', 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

answered Aug 17, 2016 at 12:14

Chris Mueller

6,7305 gold badges31 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Chris Mueller Over a year ago

@NickYellow No problem. FYI, it is generally best practice to use the with open() as statement to open files so that they are closed properly if any errors occur.

Milton · Accepted Answer · 2020-02-04 00:19:53Z

26

You may find this package useful (csv-diff):

pip install csv-diff

Once installed, you can run it from the command line:

csv-diff one.csv two.csv --key=id

answered Feb 4, 2020 at 0:19

Milton

9681 gold badge16 silver badges31 bronze badges

4 Comments

Tyler Dane Over a year ago

awesome library - easy to use and data is nicely outputted

Snehaa Ganesan Over a year ago

How may I import it? For use in jupyter noteboook

Vishnoo Rath Over a year ago

@SnehaaGanesan github.com/simonw/csv-diff has example of using as a python library

Binita Bharati Over a year ago

I was getting a

Click discovered that you exported a UTF-8 locale but the locale system could not pick up from it because it does not exist. The exported locale is 'en_US.UTF-8' but it is not supported.

on running csv-diff command. To counter this specific error, check the output of locale. LC_ALL may not have been set in the system locale. You can export the env variable LC_ALL according to your desired config. E.g: export LC_ALL=en_US.utf-8. After this export command, my csv-diff command started working.

seler · Accepted Answer · 2018-03-23 08:31:52Z

It feels natural detecting differences using sets.

#!/usr/bin/env python3

import sys
import argparse
import csv


def get_dataset(f):
    return set(map(tuple, csv.reader(f)))


def main(f1, f2, outfile, sorting_column):
    set1 = get_dataset(f1)
    set2 = get_dataset(f2)
    different = set1 ^ set2

    output = csv.writer(outfile)

    for row in sorted(different, key=lambda x: x[sorting_column], reverse=True):
        output.writerow(row)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument('infile', nargs=2, type=argparse.FileType('r'))
    parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
    parser.add_argument('-sc', '--sorting-column', nargs='?', type=int, default=0)

    args = parser.parse_args()

    main(*args.infile, args.outfile, args.sorting_column)

Graham · Accepted Answer · 2017-09-27 04:36:21Z

I assumed your new file was just like your old one, except that some lines were added in between the old ones. The old lines in both files are stored in the same order.

Try this :

with open('old.csv', 'r') as t1:
    old_csv = t1.readlines()
with open('new.csv', 'r') as t2:
    new_csv = t2.readlines()

with open('update.csv', 'w') as out_file:
    line_in_new = 0
    line_in_old = 0
    while line_in_new < len(new_csv) and line_in_old < len(old_csv):
        if old_csv[line_in_old] != new_csv[line_in_new]:
            out_file.write(new_csv[line_in_new])
        else:
            line_in_old += 1
        line_in_new += 1

Note that I used the context manager with and some meaningful variable names, which makes it instantly easier to understand. And you don't need the csv package since you're not using any of its functionalities here.
About your code, you were almost doing the right thing, except that _you must not go to the next line in your old CSV unless you are reading the same thing in both CSVs. That is to say, if you find a new line, keep reading the new file until you stumble upon an old one and then you'll be able to continue reading.

UPDATE: This solution is not as pretty as Chris Mueller's one which is perfect and very Pythonic for small files, but it only reads the files once (keeping the idea of your original algorithm), thus it can be better if you have larger file.

JL0PD · Accepted Answer · 2021-02-24 19:48:06Z

import pandas as pd
import sys
import csv

def dataframe_difference(df1: pd.DataFrame, df2: pd.DataFrame, csvfile, which=None):
    """Find rows which are different between two DataFrames."""
    comparison_df = df1.merge(
        df2,
        indicator=True,
        how='outer'
    )
    if which is None:
        diff_df = comparison_df[comparison_df['_merge'] != 'both']
    else:
        diff_df = comparison_df[comparison_df['_merge'] == which]
    diff_df.to_csv(csvfile)
    return diff_df


if __name__ == '__main__':
    df1 = pd.read_csv(sys.argv[1], sep=',')    
    df2 = pd.read_csv(sys.argv[2], sep=',')

    df1.sort_values(sys.argv[3])
    df2.sort_values(sys.argv[3])
    #df1.drop(df1.columns[list(map(int, sys.argv[4].split()))], axis = 1, inplace = True)
    #df2.drop(df2.columns[list(map(int, sys.argv[4].split()))], axis = 1, inplace = True)

    print(dataframe_difference(df1, df2, sys.argv[5]))

to use run:

python3 script.py file1.csv file2.csv some_common_header_to_sort_each_file output_file.csv

In case you want to drop any columns from comparasion, uncomment df.drop part and run

python3 script.py file1.csv file2.csv some_common_header_to_sort_each_file "x y z..." output_file.csv

where x,y,z are the column numbers to drop, index starts from 0.

rahul-ahuja · Accepted Answer · 2022-03-07 04:29:12Z

2

Thanks to @vishnoo-rath's comment under one of the above answers, for providing a link to the following page : https://github.com/simonw/csv-diff#as-a-python-library

from csv_diff import load_csv, compare
diff = compare(
    load_csv(open("one.csv"), key="id"),
    load_csv(open("two.csv"), key="id")
)
print(diff)

edited Mar 7, 2022 at 4:29

answered Feb 23, 2022 at 12:50

rahul-ahuja

1,4533 gold badges16 silver badges31 bronze badges

Comments

Aaksh Kumar · Accepted Answer · 2019-03-28 12:30:59Z

1

with open('first_test_pipe.csv', 'r') as t1, open('validation.csv', 'r') as t2:
    filecoming = t1.readlines()
    filevalidation = t2.readlines()

for i in range(0,len(filevalidation)):
    coming_set = set(filecoming[i].replace("\n","").split(","))
    validation_set = set(filevalidation[i].replace("\n","").split(","))
    ReceivedDataList=list(validation_set.intersection(coming_set))
    NotReceivedDataList=list(coming_set.union(validation_set)- 
    coming_set.intersection(validation_set))
    print(NotReceivedDataList)

answered Mar 28, 2019 at 12:30

Aaksh Kumar

92 bronze badges

Comments

Standin.Wolf · Accepted Answer · 2023-03-29 12:12:44Z

0

with open('fileone.csv', 'r') as r1, open('filetwo.csv', 'r') as r2, open('filethree.csv', 'w', newline='') as r3:
    old_csv = r1.readlines()
    new_csv = r2.readlines()        

#For example, If you want to add headers in the third file , use like below.
fieldnames = ['Configuration','Host','IPAddress','Account','Profile','Version','Type','OS']
compare_csv = csv.DictWriter(r3, fieldnames=fieldnames)
compare_csv.writeheader()
for row2 in old_csv:
    if row2 not in new_csv:
        r3.write(row2)

edited Mar 29, 2023 at 12:12

Standin.Wolf

1,2341 gold badge10 silver badges33 bronze badges

answered Dec 7, 2022 at 22:24

Vijay

11 bronze badge

Comments

Edgecase · Accepted Answer · 2024-12-27 15:25:51Z

import csv

def compare_csv_files(file1, file2, sort_keys):
    # Helper function to read and sort the files
    def read_and_sort(file_path, sort_keys):
        with open(file_path, 'r') as f:
            reader = list(csv.reader(f))
            
            # Create a sort key function
            def sort_key(row):
                return tuple(row[key] for key in sort_keys if key < len(row))
            
            # Sort data based on the provided keys
            reader.sort(key=sort_key)
            return reader
    
    # Read and sort both files
    sorted_data1 = read_and_sort(file1, sort_keys)
    sorted_data2 = read_and_sort(file2, sort_keys)
    
    # Ensure both files have the same number of rows
    max_rows = max(len(sorted_data1), len(sorted_data2))
    differences = []

    for i in range(max_rows):
        # Handle cases where one file has more rows than the other
        row1 = sorted_data1[i] if i < len(sorted_data1) else ["<missing row>"]
        row2 = sorted_data2[i] if i < len(sorted_data2) else ["<missing row>"]
        
        # Compare individual elements in the rows
        max_cols = max(len(row1), len(row2))
        for j in range(max_cols):
            # Handle cases where one row has more columns than the other
            elem1 = row1[j] if j < len(row1) else "<missing element>"
            elem2 = row2[j] if j < len(row2) else "<missing element>"
            
            if elem1 != elem2:
                differences.append({
                    "Row": i + 1,
                    "Column": j + 1,
                    "File1": elem1,
                    "File2": elem2
                })
    
    # Output the differences
    if differences:
        print("Differences found:")
        for diff in differences:
            print(f"Row {diff['Row']}, Column {diff['Column']}:")
            print(f"  File1: {diff['File1']}")
            print(f"  File2: {diff['File2']}")
    else:
        print("No differences found.")

# Example usage
# Specify the key columns for sorting (0-based index, e.g., [0, 1] for the first and second columns)
compare_csv_files("file1.txt", "file2.txt", sort_keys=[0, 1])

Collectives™ on Stack Overflow

Python : Compare two csv files and print out differences

9 Answers 9

1 Comment

4 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

1 Comment

4 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related