1

I'm trying to import data from a CSV file, unfortunately there is no primary key that would allow me to uniquely identify a given row. So I created a dictionary in which the key is the value that GetHashCode returns to me. I use the dictionary because its search is much faster than searching with linq and where with conditions for several properties.

My GetHashCode override looks like this:

    public override int GetHashCode()
    {
        unchecked
        {
            int hash = 17;
            hash = hash * 23 + this.Id.GetHashCode();
            hash = hash * 23 + this.Author?.GetHashCode() ?? 0.GetHashCode();
            hash = hash * 23 + this.Activity?.GetHashCode() ?? 0.GetHashCode();
            hash = hash * 23 + this.DateTime?.GetHashCode() ?? 0.GetHashCode();
            return hash;
        }
    }

After fetching data from DB I do:

.ToDictionary(d => d.GetHashCode());

And here comes the problem, I checked the database and I don't have any duplicates when it comes to these four parameters. But when running the import I often get an error that the given key already exists in the dictionary, but if I run the import again for the same data the next time everything runs fine.

How can I fix this error? The import application is written in .net 5

Id - long

Author, Activity - string

DateTime - DateTime?

Unfortunately, this ID is more like FK is not unique, there may be many rows with the same id, author, activity, but e.g. a different datetime

6
  • 3
    By the way, 0.GetHashCode() is always just 0. Commented Nov 29, 2022 at 14:40
  • 1
    Hashes result in duplicates by definition. .ToDictionary(d => d.GetHashCode()) is guaranteed to result in duplicate errors. Why are you using a hash as a key at all? Commented Nov 29, 2022 at 14:40
  • 0 This is a possible null protection. I want to check if the imported row does not already exist in the database, unfortunately each of these values can be duplicated in the file, so I can only import one that differs in value in one of these 4 items. Commented Nov 29, 2022 at 14:45
  • if I run the import again for the same data the next time everything runs fine The implication here is that one of the types on which you are calling GetHashCode() does not have a proper implementation for it. What are the concrete types of Id, Author and Activity? (I'm assuming that DateTime really is a DateTime) Commented Nov 29, 2022 at 14:47
  • GetHashCode doesn't need to provide different results for objects that are not considered equal by the implementation of Equals. It only should do so in order to provide good performance on sorting and dictionary access. Commented Nov 29, 2022 at 14:50

4 Answers 4

3

GetHashCode() does NOT produce unique values, so using it as a key in a dictionary can give you the errors that you have observed.

You should implement GetHashCode() AND IEquatable<T> for your key type. Then you will be able to safely put instances of it into a hashing container, so long as there are no duplicate entries. (Items x and y will only be considered duplicates if the GetHashCode() values are the same AND x.Equals(y) returns true).

So for example, your data key class could look like this:

public sealed class DataKey : IEquatable<DataKey>
{
    public long      Id       { get; }
    public string?   Author   { get; }
    public string?   Activity { get; }
    public DateTime? DateTime { get; }

    public DataKey(long id, string? author, string? activity, DateTime? dateTime)
    {
        Id       = id;
        Author   = author;
        Activity = activity;
        DateTime = dateTime;
    }

    public bool Equals(DataKey? other)
    {
        if (other is null)
            return false;

        if (ReferenceEquals(this, other))
            return true;

        return Id == other.Id && Author == other.Author && Activity == other.Activity && Nullable.Equals(DateTime, other.DateTime);
    }

    public override bool Equals(object? obj)
    {
        return ReferenceEquals(this, obj) || obj is DataKey other && Equals(other);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            var hashCode = Id.GetHashCode();
            hashCode = (hashCode * 397) ^ (Author?.GetHashCode() ?? 0);
            hashCode = (hashCode * 397) ^ (Activity?.GetHashCode() ?? 0);
            hashCode = (hashCode * 397) ^ (DateTime?.GetHashCode() ?? 0);
            return hashCode;
        }
    }
}

That's a lot of boilerplate code. Fortunately, if you are using a fairly recent version of C#/.NET you can use the record type to simplify this to just:

 public sealed record DataKey(
     long      Id,
     string?   Author,
     string?   Activity,
     DateTime? DateTime);

The record type implements IEquatable<T> and GetHashCode() correctly for you (for the specific types long, string? and DateTime?).

Note that both the example types above are immutable. It's very important when using hashing containers that the properties of a key that contribute to GetHashCode() and Equals() are immutable. If you put an item in a hashing container and then change any of those properties, nasty things happen.

Sign up to request clarification or add additional context in comments.

Comments

1

A hash by definition contains less information than the original and results in collisions. Using it as a dictionary key guarantees errors.

From the comments, it appears the real problem is using a composite key. You can use any type that uses value equality for this. Two options are ValueTuples and record, eg :

.ToDictionary(d=>(d.Id,d.Author,d.Activity,d.DateTime));

A possible problem is that ValueTuples are mutable.

You can use record or record struct to create a predefined key type that uses value equality.

public record ActivityKey( int Id, 
                           string Author, 
                           string Activity, 
                           DateTime DateTime);
...
.ToDictionary(d=>new ActivityKey(d.Id,d.Author,d.Activity,d.DateTime));

Comments

1

It seems that you may be using a different sort of hashing scheme than you need to.

If you are hashing to represent your row data as a unique value, you'll probably want something longer than an int.

Your GetHashCode() implementation looks good. However, it is for use in hash tables, not representative ID hashes, which is probably what you want.

Try something like:

public class Record {
    public int ID;
    public string Author;
    public string Activity;
    public DateTime? DateTime;
    
    public string GetRowHash() {
        var builder = new System.Text.StringBuilder();
        builder.Append(this.ID.ToString());
        builder.Append(this.Author ?? "");
        builder.Append(this.Activity ?? "");
        builder.Append(this.DateTime?.ToString() ?? "");
        
        using (var md5 = System.Security.Cryptography.MD5.Create()) {
            byte[] buffer = System.Text.Encoding.ASCII.GetBytes(builder.ToString());
            byte[] hash = md5.ComputeHash(buffer);
            return Convert.ToBase64String(hash);
        }
    }
}

Then use GetRowHash() as your ID. If you have duplicates, it will be because the row information is duplicated, not because you've overrun an int hash.

You may need to change Convert.ToBase64String(...) to something else depending on how you're storing these values in the database.

Incidentally, if you only have 4 data fields, it is way easier (and faster) to compare values in the database (using SQL) than in code. One good query to hunt out duplicates will work much more efficiently. You may even find loading the CSV directly to a table to be a good option, if your data are minimally clean enough.

Comments

0

You lose information when hashing.

You go from having multiple properties of various types (strings, datetime, numbers, etc.) and reduce that to a single integer. It's very possible that the hash function returns the same result for two sets of different property values.

GetHashCode is not intended to represent a unique key.

Instead, it might be a better idea to actually generate a unique key for each line (using something like Guid). Or, perhaps, use the Id property that you seem to already have?

2 Comments

Unfortunately, this ID is more like FK is not unique, there may be many rows with the same id, author, activity, but e.g. a different datetime
That's no reason to use a hash. Add that information in the question itself

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.