8

I have two csv file I need to compare and then spit out the differnces:

CSV FORMAT:

 Name   Produce   Number
 Adam   Apple     5
 Tom    Orange    4
 Adam   Orange    11

I need to compare the two csv files and then tell me if there is a difference between Adams apples on sheet and sheet 2 and do that for all names and produce numbers. Both CSV files will be formated the same.

Any pointers will be greatly appreciated

3
  • What version of Python are you using? Commented Jun 19, 2012 at 20:19
  • You've tagged this with excel but mention CSV files. Do you need to work with xlsx or xls files? You might find that diff works for want you need, but you haven't really said whether this needs to be done a lot and build into an existing python program. Commented Jun 19, 2012 at 20:19
  • Sorry you are correct will edit Commented Jun 19, 2012 at 20:20

6 Answers 6

8

I have used csvdiff

$pip install csvdiff
$csvdiff --style=compact col1 a.csv b.csv 

Link to package on pypi

I found this link useful

Sign up to request clarification or add additional context in comments.

Comments

5

If your CSV files aren't so large they'll bring your machine to its knees if you load them into memory, then you could try something like:

import csv
csv1 = list(csv.DictReader(open('file1.csv')))
csv2 = list(csv.DictReader(open('file2.csv')))
set1 = set(csv1)
set2 = set(csv2)
print set1 - set2 # in 1, not in 2
print set2 - set1 # in 2, not in 1
print set1 & set2 # in both

For large files, you could load them into a SQLite3 database and use SQL queries to do the same, or sort by relevant keys and then do a match-merge.

3 Comments

The dicts in the csv1 list are not hashable so creating set1 will not be possible. This can be avoided by conversion of the dicts to strings with json.dumps
@HK_CK okay I am happy for you to add that to answer... just not change it as you suggedted...
TypeError: unhashable type: 'dict'. Come on!
1

One of the best utilities for comparing two different files is diff.

See Python implementation here: Comparing two .txt files using difflib in Python

Comments

1
import csv

def load_csv_to_dict(fname, get_key, get_data):
    with open(fname, 'rb') as inf:
        incsv = csv.reader(inf)
        incsv.next()  # skip header
        return {get_key(row):get_data(row) for row in incsv}

def main():
    key = lambda r: tuple(r[0:2])
    data = lambda r: int(r[2])
    f1 = load_csv_to_dict('file1.csv', key, data)
    f2 = load_csv_to_dict('file2.csv', key, data)

    f1keys = set(f1.iterkeys())
    f2keys = set(f2.iterkeys())

    print("Keys in file1 but not file2:")
    print(", ".join(str(a)+":"+str(b) for a,b in (f1keys-f2keys)))

    print("Keys in file2 but not file1:")
    print(", ".join(str(a)+":"+str(b) for a,b in (f2keys-f1keys)))

    print("Differing values:")
    for k in (f1keys & f2keys):
        a,b = f1[k], f2[k]
        if a != b:
            print("{}:{} {} <> {}".format(k[0],k[1], a, b))

if __name__=="__main__":
    main()

1 Comment

When I try this in IDLE I get this: pastebin.com/6U035ERr (Using pastebin so you can see the whole error message with formatting)
1

If you want to use Python's csv module along with a function generator, you can use nested looping and compare large .csv files. The example below compares each row using a cursory comparision:

import csv

def csv_lazy_get(csvfile):
    with open(csvfile) as f:
        r = csv.reader(f)
        for row in r:
            yield row

def csv_cmp_lazy(csvfile1, csvfile2):
    gen_2 = csv_lazy_get(csvfile2)

    for row_1 in csv_lazy_get(csvfile1):
        row_2 = gen_2.next()

        print("row_1: ", row_1)
        print("row_2: ", row_2)

        if row_2 == row_1:
            print("row_1 is equal to row_2.")
        else:
            print("row_1 is not equal to row_2.")

    gen_2.close()

Comments

0

Here a start that does not use difflib. It is really just a point to build from because maybe Adam and apples appear twice on the sheet; can you ensure that is not the case? Should the apples be summed, or is that an error?

import csv
fsock = open('sheet.csv','rU')
rdr = csv.reader(fsock)
sheet1 = {}
for row in rdr:
    name, produce, amount = row
    sheet1[(name, produce)] = int(amount) # always an integer?
fsock.close()
# repeat the above for the second sheet, then compare

You get the idea?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.