4

I'm reading data from files (such as CSV and Excel) and need to ensure that each row in the file is unique.

Each row will be represented as an object[]. This cannot be changed due to current architecture. Each object in this array can be of different types (decimal, string, int etc).

A file can look like this:

foo    1      5 // Not unique
bar    1      5
bar    2      5
foo    1      5 // Not unique

A file is likely to have 200.000+ rows and 4-100 columns.

The code I have right now looks like this:

IList<object[]> rows = new List<object[]>();

using (var reader = _deliveryObjectReaderFactory.CreateReader(deliveryObject))
{
    // Read the row.
    while (reader.Read())
    {
        // Get the values from the file.
        var values = reader.GetValues();

        // Check uniqueness for row
        foreach (var row in rows)
        {
            bool rowsAreDifferent = false;

            // Check uniqueness for column.
            for (int i = 0; i < row.Length; i++)
            {
                var earlierValue = row[i];
                var newValue = values[i];
                if (earlierValue.ToString() != newValue.ToString())
                {
                    rowsAreDifferent = true;
                    break;
                }
            }
            if(!rowsAreDifferent)
                throw new Exception("Rows are not unique");
        }
        rows.Add(values);
    }
}

So, my question, can this be done more efficiently? Such as using hashes, and check uniqueness of the hash instead?

5
  • You do realize that it's possible for two objects to have the same hash and still be unequal, don't you? In other words, if your hash is done right, a file could have duplicate hashes but still have unique rows. Commented May 17, 2016 at 6:21
  • 1
    What about using a HashSet<T> with a custom equality comparer? Commented May 17, 2016 at 6:22
  • @phoog, yes I'm well aware of that. The solution would first check hash, and if the hashes are equal it would have to check the other values as well. But maybe it's more efficient to check hash first, instead of always checking all the values. Commented May 17, 2016 at 6:24
  • @Jehof - What do you suggest would T to be? Commented May 17, 2016 at 6:25
  • 1
    object[] as you suggested due to current architecture Commented May 17, 2016 at 6:26

1 Answer 1

4

You could use a HashSet<object[]> with a custom IEqualityComparer<object[]> like that:

HashSet<object[]> rows = new HashSet<object[]>(new MyComparer());

while (reader.Read())
{
    // Get the values from the file.
    var values = reader.GetValues();    
    if (!rows.Add(values))
        throw new Exception("Rows are not unique");
}

And that MyComparer could be implemented like that:

public class MyComparer : IEqualityComparer<object[]>
{
    public bool Equals(object[] x, object[] y)
    {
        if (ReferenceEquals(x, y)) return true;
        if (ReferenceEquals(x, null) || ReferenceEquals(y, null) || x.Length != y.Length) return false;
        return x.Zip(y, (a, b) => a == b).All(c => c);
    }
    public int GetHashCode(object[] obj)
    {
        unchecked
        {
            // this returns 0 if obj is null
            // otherwise it combines the hashes of all elements
            // like hash = (hash * 397) ^ nextHash
            // if an array element is null its hash is assumed as 0
            // (this is the ReSharper suggestion for GetHashCode implementations)
            return obj?.Aggregate(0, (hash, o) => (hash * 397) ^ (o?.GetHashCode() ?? 0)) ?? 0;
        }
    }
}

I'm not completely sure if the a==b part works for all types.

Sign up to request clarification or add additional context in comments.

8 Comments

oh, just saw that @Jehof already suggested this while I was writing, so you probably already knew how to do that...
Yes, I'm trying it out now. But without the fancy C#6-features. ;)
That last return statement looks downright scary to me. I'd probably need a good amount of coffee and 15 minutes time to figure out, why it does what it does. Do you mind adding a line or two, commenting on the ? operator and why you multiply by 391?
You can make your code a little more concise by exploiting the fact that rows.Add(values) returns a boolean to indicate whether the value was in fact added -- if it was already present and therefore not added, the method returns false and you can throw. Also, you can't use == because that will do a reference comparison on reference types.
@Marco it combines the hashcodes of all objects in the array by doing hash = (lasthash * 397) ^ nextHash, which is the way ReSharper creates GetHashCode implementations for you. So I guess it seems to be a good way to create hashes. If the array is null it returns 0 and for each null element in the array it assumes 0 as hash.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.