1

We have this old code in our repo:

    bool BufferByteCompareFiles(string filePath1, string filePath2)
    {
        int bufferCapacity = 0x400000;
        var firstFile = File.OpenRead(filePath1);
        var secondFile = File.OpenRead(filePath2);

        using var firstStream = new BufferedStream(firstFile);
        using var secondStream = new BufferedStream(secondFile);

        if (firstFile.Length != secondFile.Length)
            return false;

        var firstBuffer = new byte[bufferCapacity];
        var secondBuffer = new byte[bufferCapacity];

        int bytesReadFirst;
        int bytesReadSecond;
        do
        {
            bytesReadFirst = firstStream.Read(firstBuffer, 0, firstBuffer.Length);
            bytesReadSecond = secondStream.Read(secondBuffer, 0, secondBuffer.Length);

            if (bytesReadFirst != bytesReadSecond || !CompareByteArrays(firstBuffer, secondBuffer))
                return false;

        } while (bytesReadFirst > 0 && bytesReadSecond > 0);

        return true;
    }

    static bool CompareByteArrays(byte[] first, byte[] second)
        => first.Length == second.Length && Interop.memcmp(first, second, first.Length) == 0;

and the part that I don't understand is why check the byte array lengths in CompareByteArrays when already there is an if statement checking if bytesReadFirst != bytesReadSecond and those ints are the amount of bytes read by BufferedStream.Read().

Am I missing something or can the whole first.Length == second.Length be omitted?

4
  • 1
    This code seems wrong in the case where the number of bytes read is less than the size of the buffers as it will compare beyond the data that has actually been read. Commented Feb 25, 2023 at 15:19
  • Also, if the goal is to check if the two streams are identical, it's quicker to test whether their lengths are identical, first. Commented Feb 25, 2023 at 15:22
  • @KlausGütter, sorry, posted entire method, do you still think the issue stands? Commented Feb 25, 2023 at 15:42
  • @Yarin_007, posted entire method, there is such a check, yes. Commented Feb 25, 2023 at 15:42

1 Answer 1

1

Frankly, that code is simply wrong. These are two different tests:

  1. the first tests checks whether the amount read in each Read call is the same; Read doesn't define what it will return - Read is given a buffer with an offset and count, and can return
  • 0, for an EOF scenario
  • any number between 1 and count (inclusive) if some data is available (it is not required to fill count bytes; whatever is convenient is fine)
  • or throw an exception
  1. the second test checks whether the arrays are the same size; this is a very different question

However, because of 1, the test itself is doomed; we have a huge buffer here (too large, IMO); it would be entirely valid for firstStream.Read to return 3 and secondStream.Read to return 42, and that would tell you absolutely nothing about whether the payloads are the same - you have to keep reading both streams and comparing however many intersecting bytes are available in both. The easiest way to do that would be to fill both buffers each time, i.e.

static ReadOnlySpan<byte> ReadAsMuchAsPossible(Stream stream, byte[] buffer)
{
    int read, totalRead = 0;
    while (totalRead < buffer.Length && (read = stream.Read(buffer, totalRead, buffer.Length - totalRead)) > 0)
    {
        totalRead += read;
    }
    return new ReadOnlySpan<byte>(buffer, 0, totalRead);
}

// ... loop over the streams, with, per iteration
Debug.Assert(firstBuffer.Length == secondBuffer.Length);
var firstChunk = ReadAsMuchAsPossible(firstStream, firstBuffer);
var secondChunk = ReadAsMuchAsPossible(secondStream, secondBuffer);
bool chunkSame = firstChunk.SequenceEqual(secondChunk);

letting the runtime worry about optimizing the span comparison

Sign up to request clarification or add additional context in comments.

1 Comment

wait... i just now realised first.Length == second.Length compares array sizes. So, won't that ALWAYS be true, since both arrays are created with the same size of bufferCapacity? How would Read returning different numbers not tell me anything? I'm testing whether the files are identical, byte per byte. So if Read returns two different numbers, one byte stream is shorter and thus, not equal. Also, calling Read already advances the seeker, or whatever it was called, calling it a second time with an offset of 0 will begin reading from where the first call left off.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.