Check uniqueness for array of objects

Question

I'm reading data from files (such as CSV and Excel) and need to ensure that each row in the file is unique.

Each row will be represented as an object[]. This cannot be changed due to current architecture. Each object in this array can be of different types (decimal, string, int etc).

A file can look like this:

foo    1      5 // Not unique
bar    1      5
bar    2      5
foo    1      5 // Not unique

A file is likely to have 200.000+ rows and 4-100 columns.

The code I have right now looks like this:

IList<object[]> rows = new List<object[]>();

using (var reader = _deliveryObjectReaderFactory.CreateReader(deliveryObject))
{
    // Read the row.
    while (reader.Read())
    {
        // Get the values from the file.
        var values = reader.GetValues();

        // Check uniqueness for row
        foreach (var row in rows)
        {
            bool rowsAreDifferent = false;

            // Check uniqueness for column.
            for (int i = 0; i < row.Length; i++)
            {
                var earlierValue = row[i];
                var newValue = values[i];
                if (earlierValue.ToString() != newValue.ToString())
                {
                    rowsAreDifferent = true;
                    break;
                }
            }
            if(!rowsAreDifferent)
                throw new Exception("Rows are not unique");
        }
        rows.Add(values);
    }
}

So, my question, can this be done more efficiently? Such as using hashes, and check uniqueness of the hash instead?

You do realize that it's possible for two objects to have the same hash and still be unequal, don't you? In other words, if your hash is done right, a file could have duplicate hashes but still have unique rows. — phoog
– phoog, Commented May 17, 2016 at 6:21
What about using a HashSet<T> with a custom equality comparer? — Jehof
– Jehof, Commented May 17, 2016 at 6:22
@phoog, yes I'm well aware of that. The solution would first check hash, and if the hashes are equal it would have to check the other values as well. But maybe it's more efficient to check hash first, instead of always checking all the values. — smoksnes
– smoksnes, Commented May 17, 2016 at 6:24

René Vogt · Accepted Answer · 2016-05-17 06:57:20Z

4

You could use a HashSet<object[]> with a custom IEqualityComparer<object[]> like that:

HashSet<object[]> rows = new HashSet<object[]>(new MyComparer());

while (reader.Read())
{
    // Get the values from the file.
    var values = reader.GetValues();    
    if (!rows.Add(values))
        throw new Exception("Rows are not unique");
}

And that MyComparer could be implemented like that:

public class MyComparer : IEqualityComparer<object[]>
{
    public bool Equals(object[] x, object[] y)
    {
        if (ReferenceEquals(x, y)) return true;
        if (ReferenceEquals(x, null) || ReferenceEquals(y, null) || x.Length != y.Length) return false;
        return x.Zip(y, (a, b) => a == b).All(c => c);
    }
    public int GetHashCode(object[] obj)
    {
        unchecked
        {
            // this returns 0 if obj is null
            // otherwise it combines the hashes of all elements
            // like hash = (hash * 397) ^ nextHash
            // if an array element is null its hash is assumed as 0
            // (this is the ReSharper suggestion for GetHashCode implementations)
            return obj?.Aggregate(0, (hash, o) => (hash * 397) ^ (o?.GetHashCode() ?? 0)) ?? 0;
        }
    }
}

I'm not completely sure if the a==b part works for all types.

edited May 17, 2016 at 6:57

answered May 17, 2016 at 6:34

René Vogt

44.1k15 gold badges84 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

René Vogt Over a year ago

oh, just saw that @Jehof already suggested this while I was writing, so you probably already knew how to do that...

smoksnes Over a year ago

Yes, I'm trying it out now. But without the fancy C#6-features. ;)

Marco Over a year ago

That last return statement looks downright scary to me. I'd probably need a good amount of coffee and 15 minutes time to figure out, why it does what it does. Do you mind adding a line or two, commenting on the ? operator and why you multiply by 391?

phoog Over a year ago

You can make your code a little more concise by exploiting the fact that rows.Add(values) returns a boolean to indicate whether the value was in fact added -- if it was already present and therefore not added, the method returns false and you can throw. Also, you can't use == because that will do a reference comparison on reference types.

René Vogt Over a year ago

@Marco it combines the hashcodes of all objects in the array by doing hash = (lasthash * 397) ^ nextHash, which is the way ReSharper creates GetHashCode implementations for you. So I guess it seems to be a good way to create hashes. If the array is null it returns 0 and for each null element in the array it assumes 0 as hash.

|

Collectives™ on Stack Overflow

Check uniqueness for array of objects

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related