Comparing two csv files and getting difference

Question

I have two csv file I need to compare and then spit out the differnces:

CSV FORMAT:

 Name   Produce   Number
 Adam   Apple     5
 Tom    Orange    4
 Adam   Orange    11

I need to compare the two csv files and then tell me if there is a difference between Adams apples on sheet and sheet 2 and do that for all names and produce numbers. Both CSV files will be formated the same.

Any pointers will be greatly appreciated

You've tagged this with excel but mention CSV files. Do you need to work with xlsx or xls files? You might find that diff works for want you need, but you haven't really said whether this needs to be done a lot and build into an existing python program. — ChrisP
– ChrisP, Commented Jun 19, 2012 at 20:19

Aakash Gupta · Accepted Answer · 2016-07-23 12:48:00Z

8

I have used csvdiff

$pip install csvdiff
$csvdiff --style=compact col1 a.csv b.csv

Link to package on pypi

I found this link useful

answered Jul 23, 2016 at 12:48

Aakash Gupta

7766 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

octopusgrabbus · Accepted Answer · 2012-06-19 21:05:49Z

5

If your CSV files aren't so large they'll bring your machine to its knees if you load them into memory, then you could try something like:

import csv
csv1 = list(csv.DictReader(open('file1.csv')))
csv2 = list(csv.DictReader(open('file2.csv')))
set1 = set(csv1)
set2 = set(csv2)
print set1 - set2 # in 1, not in 2
print set2 - set1 # in 2, not in 1
print set1 & set2 # in both

For large files, you could load them into a SQLite3 database and use SQL queries to do the same, or sort by relevant keys and then do a match-merge.

edited Jun 19, 2012 at 21:05

octopusgrabbus

10.7k15 gold badges75 silver badges137 bronze badges

answered Jun 19, 2012 at 20:40

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

3 Comments

Henrik K Over a year ago

The dicts in the csv1 list are not hashable so creating set1 will not be possible. This can be avoided by conversion of the dicts to strings with json.dumps

Jon Clements Over a year ago

@HK_CK okay I am happy for you to add that to answer... just not change it as you suggedted...

3pitt Over a year ago

TypeError: unhashable type: 'dict'. Come on!

Community · Accepted Answer · 2017-05-23 12:18:10Z

1

One of the best utilities for comparing two different files is diff.

See Python implementation here: Comparing two .txt files using difflib in Python

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Jun 19, 2012 at 20:21

SomeKittens

39.6k19 gold badges117 silver badges145 bronze badges

Comments

Hugh Bothwell · Accepted Answer · 2012-06-19 21:02:46Z

1

import csv

def load_csv_to_dict(fname, get_key, get_data):
    with open(fname, 'rb') as inf:
        incsv = csv.reader(inf)
        incsv.next()  # skip header
        return {get_key(row):get_data(row) for row in incsv}

def main():
    key = lambda r: tuple(r[0:2])
    data = lambda r: int(r[2])
    f1 = load_csv_to_dict('file1.csv', key, data)
    f2 = load_csv_to_dict('file2.csv', key, data)

    f1keys = set(f1.iterkeys())
    f2keys = set(f2.iterkeys())

    print("Keys in file1 but not file2:")
    print(", ".join(str(a)+":"+str(b) for a,b in (f1keys-f2keys)))

    print("Keys in file2 but not file1:")
    print(", ".join(str(a)+":"+str(b) for a,b in (f2keys-f1keys)))

    print("Differing values:")
    for k in (f1keys & f2keys):
        a,b = f1[k], f2[k]
        if a != b:
            print("{}:{} {} <> {}".format(k[0],k[1], a, b))

if __name__=="__main__":
    main()

edited Jun 19, 2012 at 21:02

answered Jun 19, 2012 at 20:37

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

1 Comment

Jeremy Pridemore Over a year ago

When I try this in IDLE I get this: pastebin.com/6U035ERr (Using pastebin so you can see the whole error message with formatting)

octopusgrabbus · Accepted Answer · 2012-06-20 13:00:33Z

1

If you want to use Python's csv module along with a function generator, you can use nested looping and compare large .csv files. The example below compares each row using a cursory comparision:

import csv

def csv_lazy_get(csvfile):
    with open(csvfile) as f:
        r = csv.reader(f)
        for row in r:
            yield row

def csv_cmp_lazy(csvfile1, csvfile2):
    gen_2 = csv_lazy_get(csvfile2)

    for row_1 in csv_lazy_get(csvfile1):
        row_2 = gen_2.next()

        print("row_1: ", row_1)
        print("row_2: ", row_2)

        if row_2 == row_1:
            print("row_1 is equal to row_2.")
        else:
            print("row_1 is not equal to row_2.")

    gen_2.close()

answered Jun 20, 2012 at 13:00

octopusgrabbus

10.7k15 gold badges75 silver badges137 bronze badges

Comments

ChrisP · Accepted Answer · 2012-06-19 20:26:04Z

0

Here a start that does not use difflib. It is really just a point to build from because maybe Adam and apples appear twice on the sheet; can you ensure that is not the case? Should the apples be summed, or is that an error?

import csv
fsock = open('sheet.csv','rU')
rdr = csv.reader(fsock)
sheet1 = {}
for row in rdr:
    name, produce, amount = row
    sheet1[(name, produce)] = int(amount) # always an integer?
fsock.close()
# repeat the above for the second sheet, then compare

You get the idea?

answered Jun 19, 2012 at 20:26

ChrisP

5,9721 gold badge36 silver badges36 bronze badges

Collectives™ on Stack Overflow

Comparing two csv files and getting difference

6 Answers 6

Comments

3 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

3 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related