Python: Performing Modulo on Strings

Question

I have several billion strings in the format word0.word1.word2, and I wish to perform modulo n on those strings so that I can feed each to a database writer for storage. I know I can perform a form a modulo 10 on the first character of the strings like this:

for i in ["a.b","c.d"]: 
    print ord(i[0]) % 10

This won't divide my strings evenly, though, as word0, word1, and word2 are sorted into alphabetical order, and the first character of the string is very often "a". I could take the last letter of the string, but am not sure if those are normally distributed or not.

My question: Is there a fast way to perform something like "ord" on the entire string? I ultimately plan to run modulo 48 on the integer representations of the strings, and wish for that modular output to be uniformly distributed across all 48 cores. I would be grateful for any help others can offer.

So basically you are looking for a good hash function I think your best bet would be to take an existing implementation — Konstantin
– Konstantin, Commented Jul 20, 2015 at 12:10
@boardrider md5 is a cryptographic hash function, not the one you want to use in hashmap-like structures. — Konstantin
– Konstantin, Commented Jul 20, 2015 at 12:18

dlask · Accepted Answer · 2015-07-20 13:49:47Z

4

s = "whatever"  # have a string
h = hash(s)     # obtain its hash
bin = h % 48    # find the bin

Update: The Python's built-in hash function provides deterministic values only for a single process. If you want to keep this information (directly or indirectly ) in a database you have to use an explicit hash function that doesn't include any random data. (Credit goes to @Alik)

edited Jul 20, 2015 at 13:49

answered Jul 20, 2015 at 12:12

dlask

9,0421 gold badge30 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

duhaime Over a year ago

Thanks @dlask. I realized halfway through typing out the question that I am indeed after a good hash function. I'll accept this answer and fish around for the optimal implementation of hash()

dlask Over a year ago

This is not a template, it's a working code. The hash function is provided directly by Python. However, you can use another hash function, of course.

Konstantin Over a year ago

@duhaime please, read carefully documentation on __hash__ function. The most important part here is the last note. Also even if you use old Python3 versions or Python 2.7 __hash__ is not guaranteed to be a deterministic function.

duhaime Over a year ago

Thank you @Alik, this is essential. Do you know of a fast deterministic hash function that would help create a fairly uniform distribution appropriate for my %48 situation? Non-deterministic hashes are out of the question.

Konstantin Over a year ago

@duhaime all hashes must be deterministic functions. Cryptographic hash function is a hash function which is very hard to revert (i.e. get original input given hash value only). Usually they are slower compared to non-cryptographic hashes (see comparison table on xxHash's homepage - MD5, SHA1 are cryptographic hashes)

|

Collectives™ on Stack Overflow

Python: Performing Modulo on Strings

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related