2

I'm trying to compute hashes for a whole directory, in order to monitor changes later. It's relatively easy. However, if there are big files, the computing takes too much time, so I wound up using some multithreading.

Thanks to I/O bottlenecks, I should read a file with one thread, but I can calculate hash for that file in multiple threads with calling TransformBlock methods all at once. The problem is, the result of each calculation is different - 'cause all the threads update one instance of a hashAlgorithm, they do it erratically.

  public delegate void CalculateHashDelegate(byte[] buffer);
  private MD5 md5;        
  private long completed_threads_hash;
  private object lock_for_hash = new object();

 `private string getMd5Hash(string file_path)
  {
        string file_to_be_hashed = file_path;
        byte[] hash;

        try
        {
            CalculateHashDelegate CalculateHash = AsyncCalculateHash;
            md5 = MD5.Create();

            using (Stream input = File.OpenRead(file_to_be_hashed))
            {
                int buffer_size = 0x4096;
                byte[] buffer = new byte[buffer_size];

                long part_count = 0;
                completed_threads_hash = 0;
                int bytes_read;
                while ((bytes_read = input.Read(buffer, 0, buffer.Length)) == buffer_size)
                {
                    part_count++;
                    IAsyncResult ar_hash = CalculateHash.BeginInvoke(buffer, CalculateHashCallback, CalculateHash);
                }

                // Wait for completing all the threads
                while (true)
                {
                    lock (completed_threads_lock)
                    {
                        if (completed_threads_hash == part_count)
                        {  
                            md5.TransformFinalBlock(buffer, 0, bytes_read);
                            break;
                        }
                    }
                }

                hash = md5.Hash;

            }

            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < hash.Length; i++)
            {
                sb.Append(hash[i].ToString("x2"));
            }
            md5.Clear();
            return sb.ToString();
        }
        catch (Exception ex)
        {
            Console.WriteLine("An exception was encountered during hashing file {0}. {1}.", file_to_be_hashed, ex.Message);
            return ex.Message;
        }
    }

    public void AsyncCalculateHash(byte[] buffer)
    {
        lock (lock_for_hash)
        {
            md5.TransformBlock(buffer, 0, buffer.Length, null, 0);
        }
    }

    private void CalculateHashCallback(IAsyncResult ar_hash)
    {
        try
        {
            CalculateHashDelegate CalculateHash = ar_hash.AsyncState as CalculateHashDelegate;
            CalculateHash.EndInvoke(ar_hash);
        }
        catch (Exception ex)
        {
            Console.WriteLine("Callback exception: ", ex.Message);
        }
        finally
        {
            lock (completed_threads_lock)
            {
                completed_threads_hash++;
            }
        }
    }

Is there a way to organize the hashing process? I can't use .Net newer than 3.5 and such classes as BackroundWorker and ThreadPool. Or maybe there is another method for parallel hash calculating?

1 Answer 1

2

Generally you cannot use cryptographic objects within multi-threaded code. The problem with hash methods is that they are fully linear - each block of hashing depends on the current state, and the state is calculated using all the previous blocks. So basically, you cannot do this for MD5.

There is another process that can be used, and it is called a hash tree or Merkle tree. Basically you decide on a block size and calculate the hashes for the blocks. These hashes are put together and hashed again. If you have a very large number of hashes you may actually create a tree as described in the Wikipedia article linked to earlier. Of course the resulting hash is different from just MD5 and depends on the configuration parameters of the hash tree.

Note that MD5 has been broken. You should be using SHA-256 or SHA-512/xxx (faster on 64 bit processors) instead. Also note that often the IO speed is more of an obstruction than the speed of the hash algorithm, negating any speed advantages of hash trees. If you have many files, you could also parallelize the hashing on file level.

Sign up to request clarification or add additional context in comments.

3 Comments

I could use any of the HashAlgorithm's subclasses, the MD5 was chosen for simplicity. And I cannot parallelize reading from hdd, because in case of big files it causes upredictable delays. If there is any way to parallelize, it should be found in the hash computing part.
I have tried to implement hashing of concatenated hashes, It seemed very promising, but hashing of all the block's hashes, concatenated in one byte[] array, shows different results. When I did it in a linear approach, the result is fine, constant. But when I did it in parallel threads, the result is different almost every other time. I think, it's because of the same problem, as with TransformBlocks - the source for resulting hashing is mutable. And I don't see how to create an actual Merkle tree in parallel threads - the intermediate hashings will be different too. Am I wrong?
As long as you keep the order of the hashes in order you should always create the same Merkle tree hash result. Obviously you need to keep the block size etc. constant; the tree itself should look identical each time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.