Return to Answer

added 37 characters in body

Source Link

edited Oct 6, 2016 at 0:45

Simon

1.1k
6
15

That being said, they problem statement seems foggy. The MD5 function does not ever yield the same hash for two different sets of data, when concerned with these kinda problems. That is way it is called a hash function or a one way function. If the hashes is identical the content is identical.

Source Link

answered Oct 6, 2016 at 0:30

Simon

1.1k
6
15

Well this was a very satisfying problem, thanks for sharing!

First of all, calling external resources is expensive, there for not optimized, which you ask for. Else wise, calling external resources can be preferable if the external resource is something like shell on a platform you have control over. That's the reasons I removed them and substituted them with python built-ins. It's pretty much the only reason this code is slightly faster then yours.

I found one small error in your code. What if a file you try to hash has spaces? The problem occurs when you split the return from md5_checksum, it splits to as many values as there are white spaces.

The most time consuming function of both our code is walk. It's easy to check these where cpu-time went with profilers. And python has a builtin I like, but there are many. It's the cProfiler, check my code for usage.

The biggest change was refactoring the function are_identical for

if any(cmp(x, y) for x in paths for y in paths if y != x):
                print('\nThey are identical\n')

They do the same thing, but the any() builtin, is.. as well faster, then iterating over lists.

I did remove your function comments, as they can be substituted for good function names and annotations. Do you agree?

from os import walk
from os.path import join
from hashlib import md5
from filecmp import cmp
from base64 import b64encode
from time import time
import cProfile


def md5_checksum(file_path: str) -> (bytes, str):
    """ Returns the raw MD5 bytes here used as checksum a given files content """

    with open(file_path, "rb") as f:
        file = f.read()
    m = md5()
    m.update(file)
    return m.digest(), file_path


def md5_checksum_table(dir_name: str, suffix: str) -> {bytes: [str]}:
    """
    Searches a directory for files with a given file format (a suffix) and
      computes their MD5 checksums.
    """
    table = {}
    for root, sub, files in walk(dir_name):
        for file in files:
            if file.endswith(suffix):
                checksum, filename = md5_checksum(join(root, file))
                table.setdefault(checksum, []).append(filename)
    return table


def print_duplicates(checksums: {bytes: [str]}):
    """ Prints paths of files have the same MD5 checksum and are identical. """

    for checksum, paths in checksums.items():
        if len(paths) > 1:
            print('Files have the checksum {0} are:\n {1}'.format(b64encode(checksum),
                                                                "\n".join(paths)))
            if any(cmp(x, y) for x in paths for y in paths if y != x):
                print('\nThey are identical\n')


def main():
    start = time()
    table = md5_checksum_table('/media/sf_Shared/', '.pdf')
    print_duplicates(table)
    print("Time {:.3f}s".format(time()-start))
    cProfile.run("md5_checksum_table('/home/cly/', '.pdf')")
    cProfile.run("print_duplicates({})".format(table))



if __name__ == '__main__':
    main()

That being said, they problem statement seems foggy. The MD5 function does not ever yield the same hash for two different sets of data. That is way it is called a hash function or a one way function. If the hashes is identical the content is identical.

The last thing I will say, is that even the very fast hash function MD5, is slower then a efficient comparing of the content. So I criticize the problem not your solution.

Thanks! Good work.