2

I am trying to implement a randomized algorithm called tidemark. It is described in Unit 2 of these lecture notes. To do this I need a randomly chosen hash function that maps the integers [1,...,n] to [1,...,n]. Python has a few different hash function in libraries but I didn't find one that would enable me to specify the domain and range and would choose a suitable function at random.

Does such a thing exist?

2
  • 1
    The notes you linked use a "hash function" in specific sense, namely a function from a 2-universal hash family. If you don't know what that means, it may be best to follow what the paper says before proceeding: "If you are unfamiliar with the concept [of a 2-universal hash family], working through Exercises 2-1 and 2-2 is strongly recommended." Those exercise describe exactly what sort of function you need here. A randomly chosen function may not be adequate. Commented Apr 20, 2021 at 14:22
  • @Brian I do know that that means. I just wanted an easy way to implement it in python without reinventing the wheel. Commented Apr 20, 2021 at 14:23

3 Answers 3

3

Well, off the top of my head, I'd piggyback off Python's hash() but twist the numbers it returns using a random number. (Note that hash() doesn't have a deterministic value between Python interpreter restarts either.)

import secrets

def gen_random_hasher(max_val=1024):
    seed = secrets.randbits(64)
    return lambda val: (hash(val) ^ seed) % max_val

s1 = gen_random_hasher()
s2 = gen_random_hasher()

print(s1('aaa'), s1(123))
assert s1('aaa') == s1('aaa')  # "prove" the function is deterministic
print(s2('aaa'), s2(123))

This prints out e.g.

447 885
55 765

so you can tell s1 and s2 are different.

Sign up to request clarification or add additional context in comments.

3 Comments

(hash(val)^seed)%max_val does not generate independent hash functions if max_val is a power of two, which could be a common occurrence. If val is an integer you could do hash(val^seed)%max_val instead provided that hash() is a good hash function, which it is not in my installation, where it seems to be the identity function.
@WolfgangBrehm hash() is an identity function for (small enough) integers, but not for anything else: github.com/python/cpython/blob/…
That could still be a problem for OP, as he wants to hash integers.
1

What you are looking for is a universal hash function. Many hashing algorithms use a random number internally that can be used to generalize them to a universal hash function.

You can also use any integer hash function and then multiply with a random integer larger than 0 before the modulus reduction.

Depending on how well the values before hashing are distributed and how well the hash values need to be distributed this is probably entirely sufficient:

from functools import partial
from random import randint

def hash_family(value,n,salt):             # output is [0,n) (i.e. excluding n)
  value=value*(2*salt+1)                   # multiplying with 0 is bad
  value=value^(value>>(n.bit_length()//2)) # xorshift to improve distribution
  value=value&((1<<n.bit_length())-1)      # cut off high bits for speed
  value=value*(2*salt+1)                   # another multiplication
  value=value^(value>>(n.bit_length()//2)) # another xorshift
  value=value&((1<<n.bit_length())-1)      # cut off high bits
  value=value%n                            # modulus reduction
  return value

random_hash = partial(hash_family,salt=randint(0,2**64))
>>> hash_family(23,100,56)
73
>>> hash_family(23,100,57)
52
>>> hash_family(23,100,58)
30
>>> random_hash(17,100)
42

For cryptographic strength use pythons hashlib, many of the hash functions there have a salt parameter that can be set with randint and bound using partial to use it as a universal hash function. Range reduction can be done with the modulo operation.

Comments

0

Use standard md5 hash and use os.urandom as the random seed, and the hash function can be reused through the seed

def get_random_hashfunc(_max=1024, seed=None):
    seed = seed or os.urandom(10)
    seed_hash_func = hashlib.md5(seed)
    def hashfunc(n):
        func = seed_hash_func.copy()
        func.update(n.to_bytes(n.bit_length(), 'big'))
        return int.from_bytes(func.digest(), 'big') % _max
    return hashfunc, seed

hash_func1, seed1 = get_random_hashfunc()
hash_func2, seed2 = get_random_hashfunc()
hash_func3, seed3 = get_random_hashfunc(seed=seed1)

>>> hash_func1(123)
156
>>> hash_func2(123)
931
>>> hash_func3(123)
156

1 Comment

I guess _max has to be a power of two for this to work properly?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.