284

What's a correct and good way to implement __hash__()?

I am talking about the function that returns a hashcode that is then used to insert objects into hashtables aka dictionaries.

As __hash__() returns an integer and is used for "binning" objects into hashtables I assume that the values of the returned integer should be uniformly distributed for common data (to minimize collisions). What's a good practice to get such values? Are collisions a problem? In my case I have a small class which acts as a container class holding some ints, some floats and a string.

9 Answers 9

317

An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.

Here's an example of using a key for hash and equality:

class A:
    def __key(self):
        return (self.attr_a, self.attr_b, self.attr_c)

    def __hash__(self):
        return hash(self.__key())

    def __eq__(self, other):
        if isinstance(other, A):
            return self.__key() == other.__key()
        return NotImplemented

Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.

Sign up to request clarification or add additional context in comments.

7 Comments

Aside from the minor overhead from factoring out the __key function, this is about as fast as any hash can be. Sure, if the attributes are known to be integers, and there aren't too many of them, I suppose you could potentially run slightly faster with some home-rolled hash, but it likely wouldn't be as well distributed. hash((self.attr_a, self.attr_b, self.attr_c)) is going to be surprisingly fast (and correct), as creation of small tuples is specially optimized, and it pushes the work of getting and combining hashes to C builtins, which is typically faster than Python level code.
Let's say an object of class A is being used as a key for a dictionary and if an attribute of class A changes, its hash value will change as well. Wouldn't that create a problem ?
As @loved.by.Jesus's answer below mentions, hash method should not be defined/overridden for a mutable object (defined by default and uses id for equality and comparison).
@Miguel, I ran into the exact problem, what happens is the dictionary returns None once the key changes. The way I solved it was by storing the id of the object as a key instead of just the object.
@JaswantP Python by default uses id of the object as the key for any hashable object.
|
37

John Millikin proposed a solution similar to this:

class A(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        return (isinstance(othr, type(self))
                and (self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))

    def __hash__(self):
        return hash((self._a, self._b, self._c))

The problem with this solution is that the hash(A(a, b, c)) == hash((a, b, c)). In other words, the hash collides with that of the tuple of its key members. Maybe this does not matter very often in practice?

Update: the Python docs now recommend to use a tuple as in the example above. Note that the documentation states

The only required property is that objects which compare equal have the same hash value

Note that the opposite is not true. Objects which do not compare equal may have the same hash value. Such a hash collision will not cause one object to replace another when used as a dict key or set element as long as the objects do not also compare equal.

Outdated/bad solution

The Python documentation on __hash__ suggests to combine the hashes of the sub-components using something like XOR, which gives us this:

class B(object):

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __eq__(self, othr):
        if isinstance(othr, type(self)):
            return ((self._a, self._b, self._c) ==
                    (othr._a, othr._b, othr._c))
        return NotImplemented

    def __hash__(self):
        return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
                hash((self._a, self._b, self._c)))

Update: as Blckknght points out, changing the order of a, b, and c could cause problems. I added an additional ^ hash((self._a, self._b, self._c)) to capture the order of the values being hashed. This final ^ hash(...) can be removed if the values being combined cannot be rearranged (for example, if they have different types and therefore the value of _a will never be assigned to _b or _c, etc.).

7 Comments

You usually don't want to do a straight XOR the attributes together, as that will give you collisions if you change the order of the values. That is, hash(A(1, 2, 3)) will be equal to hash(A(3, 1, 2)) (and they'll both hash equal to any other A instance with a permutation of 1, 2 and 3 as its values). If you want to avoid your instance having the same hash as a tuple of their arguments, simply create a sentinel value (as either a class variable, or as a global) then include it in the tuple to be hashed: return hash((_sentinel, self._a, self._b, self._c))
Your use of isinstance could be problematic, since an object of a subclass of type(self) can now be equal to an object of type(self). So you may find that adding a Car and a Ford to a set() may result in only one object inserted, depending on the order of insertion. Additionally, you may run into a situation where a == b is True but b == a is False.
If you're subclassing B, you may want to change that to isinstance(othr, B)
A thought: the key tuple could include the class type, which would prevent other classes with the same key set of attributes from being shown to be equal: hash((type(self), self._a, self._b, self._c)).
Besides the point about using B instead of type(self), it's also often considered better practice to return NotImplemented when encountering an unexpected type in __eq__ instead of False. That allows other user-defined types to implement an __eq__ that knows about B and can compare equal to it, if they wish to.
|
17

Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that

for c in some_string:
    hash = 101 * hash  +  ord(c)

worked surprisingly well for a wide variety of strings. I've found that similar polynomial techniques work well for computing a hash of disparate subfields.

5 Comments

Apparently Java does it the same way but using 31 instead of 101
What's the rationale behind using these numbers? Is there a reason to pick 101, or 31?
Here's an explanation for prime multipliers: stackoverflow.com/questions/3613102/…. 101 seems to work particularly well, based on Paul Larson's experiments.
Python uses (hash * 1000003) XOR ord(c) for strings with 32-bit wraparound multiplication. [Citation]
Even if this is true it is of no practical use in this context because the builtin Python string types already provide a __hash__ method; we don't need to roll our own. The question is how to implement __hash__ for a typical user-defined class (with a bunch of properties pointing to built-in types or perhaps to other such user-defined classes), which this answer doesn't address at all.
8

A good way to implement hash (as well as list, dict, tuple) is to make the object have a predictable order of items by making it iterable using __iter__. So to modify an example from above:

class A:

    def __init__(self, a, b, c):
        self._a = a
        self._b = b
        self._c = c

    def __iter__(self):
        yield "a", self._a
        yield "b", self._b
        yield "c", self._c

    def __hash__(self):
        return hash(tuple(self))

    def __eq__(self, other):
        return (isinstance(other, type(self))
                and tuple(self) == tuple(other))

(here __eq__ is not required for hash, but it's easy to implement).

Now add some mutable members to see how it works:

obj = A(2, 2.2, "cat")
hash(obj)  # 977171927842460677
dict(obj)  # {'a': 2, 'b': 2.2, 'c': 'cat'}
list(obj)  # [('a', 2), ('b', 2.2), ('c', 'cat')]
tuple(obj) # (('a', 2), ('b', 2.2), ('c', 'cat'))
assert obj == A(2, 2.2, "cat")
assert obj != A(2, 2.2, "cats")

things only fall apart if you try to put non-hashable members in the object model:

hash(A(2, 2.2, [1]))  # TypeError: unhashable type: 'list'

Comments

3

I can try to answer the second part of your question.

The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you'll get collisions on insertion.

In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!

Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.

Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.

In short, depending on what you're using the hash code for you may have to implement your own collision resolution method. If you're not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.

Here are some links if you're interested more:

coalesced hashing on wikipedia

Wikipedia also has a summary of various collision resolution methods:

Also, "File Organization And Processing" by Tharp covers alot of collision resolution methods extensively. IMO it's a great reference for hashing algorithms.

Comments

3

A very good explanation on when and how implement the __hash__ function is on programiz website:

Just a screenshot to provide an overview: (Retrieved 2019-12-13)

Screenshot of https://www.programiz.com/python-programming/methods/built-in/hash 2019-12-13

As for a personal implementation of the method, the above mentioned site provides an example that matches the answer of millerdev.

class Person:
def __init__(self, age, name):
    self.age = age
    self.name = name

def __eq__(self, other):
    return self.age == other.age and self.name == other.name

def __hash__(self):
    print('The hash is:')
    return hash((self.age, self.name))

person = Person(23, 'Adam')
print(hash(person))

Comments

3

@dataclass(frozen=True) (Python 3.7)

This awesome new feature automatically defines a __hash__ and __eq__ method for you, making it just work as usually expected in dicts and sets without any need for tedious repetition:

dataclass_cheat.py

from dataclasses import dataclass, FrozenInstanceError

@dataclass(frozen=True)
class MyClass1:
    n: int
    s: str

@dataclass(frozen=True)
class MyClass2:
    n: int
    my_class_1: MyClass1

d = {}
d[MyClass1(n=1, s='a')] = 1
d[MyClass1(n=2, s='a')] = 2
d[MyClass1(n=2, s='b')] = 3
d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] = 4
d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] = 5
d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] = 6

assert d[MyClass1(n=1, s='a')] == 1
assert d[MyClass1(n=2, s='a')] == 2
assert d[MyClass1(n=2, s='b')] == 3
assert d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] == 4
assert d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] == 5
assert d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] == 6

# Due to `frozen=True` we can't modify objects.
o = MyClass1(n=1, s='a')
try:
    o.n = 2
except FrozenInstanceError as e:
    pass
else:
    raise 'error'

As we can see in this example, the hashes are being calculated based on the contents of the objects, and not simply on the addresses of instances. This is why something like:

d = {}
d[MyClass1(n=1, s='a')] = 1
assert d[MyClass1(n=1, s='a')] == 1

works even though the second MyClass1(n=1, s='a') is a completely different instance from the first with a different address.

frozen=True is mandatory, otherwise the class is not hashable, otherwise it would make it possible for users to inadvertently make containers inconsistent by modifying objects after they are used as keys. Further documentation: https://docs.python.org/3/library/dataclasses.html

@dataclass also provides other good things like __str__, it's awesome.

Tested on Python 3.10.7, Ubuntu 22.10.

2 Comments

what overhead will decorating with dataclass add?
@NaveenReddyMarthala as opposed to manually defining __init__, __hash__, __eq__? Basically none, as it produces the same class I believe, except one single function call to create the class itself, which normally only runs once.
0

Depends on the size of the hash value you return. It's simple logic that if you need to return a 32bit int based on the hash of four 32bit ints, you're gonna get collisions.

I would favor bit operations. Like, the following C pseudo code:

int a;
int b;
int c;
int d;
int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);

Such a system could work for floats too, if you simply took them as their bit value rather than actually representing a floating-point value, maybe better.

For strings, I've got little/no idea.

1 Comment

I know that there will be collisions. But I have no clue how these are handled. And furhermore my attribute values in combination are very sparsely distributed so I was looking for a smart solution. And somehow I expected there to be a best practice somewhere.
-1

I will contend that because of the definition of __hash__ as it is known today you necessarily cannot have an implementation that is both "correct" -and- "good" at the same time. So instead, let's settle on "good" (since the accepted answer instead aims to be "correct"):

class Thing:
    def __hash__(self):
        return id(self)

    def __eq__(self, other):
        # it is advised to implement __eq__ when you implement __hash__
        # however this __eq__ impl has nothing to do with this being a 
        # "good" implementation of __hash__, it merely ensures correctness
        # of behavior for use in container types. you could obviously write
        # something more tailored to your class than this, this is merely
        # inheritance-friendly.
        if not isinstance(other, type(self)):            
            return False
        for aname in (set(dir(self)) | set (dir(other))):
            if aname.startswith('__'):
                continue
            if (not hasattr(self, aname)) or (not hasattr(other, aname)):
                return False
            atype = type(self.attr)
            if atype is types.MethodType or atype is types.FunctionType:
                continue
            if getattr(self, aname) != getattr(other, aname):
                return False
        return True

For this implementation __hash__ yields a value that is unique over the entire process for the lifetime of the object. For objects stored to a list or set (where reachability prevents object destruction) this is acceptable.

When used as a key for a dict, if the object is no longer reachable and is destroyed this key "may" collide. We must also consider that __hash__ results from other objects "may" collide as well due to differences in implementations. For these reasons implementations of containers typically check for object equality (both in Python and in other languages, as the notion of object hashes for container identity is not a uniquely Python concept.)

There are ways of defeating the problem of collision without depending on __eq__ but that is best reserved for another Q&A, and really deserves the consideration of container devs not the developers using their containers.

In any case, this implementation has the value of efficiency and simplicity, and does not suffer from collisions resulted from relying on the __hash__ of "correct" implementations such as that of Tuple. It is ideal that object identity is firstly resolved by __hash__ result comparison avoiding an additional dispatch for a __eq__ result.

To demonstrate this implementation, consider the following subclass and-also a unittest class for verification:

class SubThing(Thing):
    def __init__(self, val1, val2):
        if val1 != None:
            self.val1 = val1
        if val2 != None:
            self.val2 = val2

import unittest
class ThingTests(unittest.TestCase):
    def test_Things(self):
        t1 = Thing()
        t2 = Thing()
        self.assertEqual(t1, t2)
        set1 = set()
        set1.add(t1)
        set1.add(t2)
        set1.add(t1)
        set1.add(t2)
        self.assertEqual(2, len(set1))
        t3 = SubThing(1,2)
        t4 = SubThing(1,2)
        t5 = SubThing(2,3)
        t6 = SubThing(None,None)
        t7 = SubThing(None,5)
        t8 = SubThing(5,None)
        set2 = set()
        set2.add(t3)
        set2.add(t4)
        set2.add(t5)
        set2.add(t6)
        set2.add(t7)
        set2.add(t8)
        self.assertEqual(6, len(set2))
        self.assertEqual(t1, t2)
        self.assertEqual(t3, t4)
        self.assertNotEqual(t1, t3)
        self.assertNotEqual(t3, t4)
        self.assertNotEqual(t4, t5)
        self.assertNotEqual(t5, t6)
        self.assertNotEqual(t6, t7)
        self.assertNotEqual(t7, t8)

Test passes.

Conclusion

This implementation is "good" but it is not "correct", why?

Because the current Python spec defines __hash__ with the constraint that two objects which would return true in an equality test must also return the same hash value. Bad. This will likely plague Python until the end of time. That doesn't mean you need to write naive code. The above will comply with all but the worst container implementations (those which foolishly rely on object equality instead of object hashes), and if there is a container implementation that misbehaves because of the above __hash__ then that impl it deserves to be called out and fixed.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.