4

I have several billion strings in the format word0.word1.word2, and I wish to perform modulo n on those strings so that I can feed each to a database writer for storage. I know I can perform a form a modulo 10 on the first character of the strings like this:

for i in ["a.b","c.d"]: 
    print ord(i[0]) % 10

This won't divide my strings evenly, though, as word0, word1, and word2 are sorted into alphabetical order, and the first character of the string is very often "a". I could take the last letter of the string, but am not sure if those are normally distributed or not.

My question: Is there a fast way to perform something like "ord" on the entire string? I ultimately plan to run modulo 48 on the integer representations of the strings, and wish for that modular output to be uniformly distributed across all 48 cores. I would be grateful for any help others can offer.

4
  • 1
    So basically you are looking for a good hash function I think your best bet would be to take an existing implementation Commented Jul 20, 2015 at 12:10
  • What should something_like_ord(whole_string) return? Commented Jul 20, 2015 at 12:12
  • stackoverflow.com/questions/5297448/… Commented Jul 20, 2015 at 12:14
  • @boardrider md5 is a cryptographic hash function, not the one you want to use in hashmap-like structures. Commented Jul 20, 2015 at 12:18

1 Answer 1

4
s = "whatever"  # have a string
h = hash(s)     # obtain its hash
bin = h % 48    # find the bin

Update: The Python's built-in hash function provides deterministic values only for a single process. If you want to keep this information (directly or indirectly ) in a database you have to use an explicit hash function that doesn't include any random data. (Credit goes to @Alik)

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks @dlask. I realized halfway through typing out the question that I am indeed after a good hash function. I'll accept this answer and fish around for the optimal implementation of hash()
This is not a template, it's a working code. The hash function is provided directly by Python. However, you can use another hash function, of course.
@duhaime please, read carefully documentation on __hash__ function. The most important part here is the last note. Also even if you use old Python3 versions or Python 2.7 __hash__ is not guaranteed to be a deterministic function.
Thank you @Alik, this is essential. Do you know of a fast deterministic hash function that would help create a fairly uniform distribution appropriate for my %48 situation? Non-deterministic hashes are out of the question.
@duhaime all hashes must be deterministic functions. Cryptographic hash function is a hash function which is very hard to revert (i.e. get original input given hash value only). Usually they are slower compared to non-cryptographic hashes (see comparison table on xxHash's homepage - MD5, SHA1 are cryptographic hashes)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.