I've found the standard hashing function on VS2005 is painfully slow when trying to achieve high performance look ups. What are some good examples of fast and efficient hashing algorithms that should void most collisions?
-
30The following has a good set of general purpose hash functions, you should try them against your data set, some may outperform others based on collisions: partow.net/programming/hashfunctions/index.htmlMatthieu N.– Matthieu N.2009-10-24 09:00:35 +00:00Commented Oct 24, 2009 at 9:00
-
1Possible duplicate of Good Hash Function for StringsM.J. Rayburn– M.J. Rayburn2017-09-02 01:04:42 +00:00Commented Sep 2, 2017 at 1:04
11 Answers
I worked with Paul Larson of Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.
unsigned int
hash(
const char* s,
unsigned int seed = 0)
{
unsigned int hash = seed;
while (*s)
{
hash = hash * 101 + *s++;
}
return hash;
}
5 Comments
s = "hello", the hashes of all prefixes of s are: [104, 10605, 1071213, 108192621, 2337520240] How to quickly find hash of substring hel?From some old code of mine:
/* magic numbers from http://www.isthe.com/chongo/tech/comp/fnv/ */
static const size_t InitialFNV = 2166136261U;
static const size_t FNVMultiple = 16777619;
/* Fowler / Noll / Vo (FNV) Hash */
size_t myhash(const string &s)
{
size_t hash = InitialFNV;
for(size_t i = 0; i < s.length(); i++)
{
hash = hash ^ (s[i]); /* xor the low 8 bits */
hash = hash * FNVMultiple; /* multiply by the magic number */
}
return hash;
}
Its fast. Really freaking fast.
5 Comments
^ is transitive, which is bad. FNVMultiple is not prime, which is bad. InitialFNV isn't prime either, which may or may not be bad, I'm uncertain.That always depends on your data-set.
I for one had surprisingly good results by using the CRC32 of the string. Works very good with a wide range of different input sets.
Lots of good CRC32 implementations are easy to find on the net.
Edit: Almost forgot: This page has a nice hash-function shootout with performance numbers and test-data:
http://smallcode.weblogs.us/ <-- further down the page.
Comments
Boost has an boost::hash library which can provides some basic hash functions for most common types.
Comments
I've use the Jenkins hash to write a Bloom filter library, it has great performance.
Details and code are available here: http://burtleburtle.net/bob/c/lookup3.c
This is what Perl uses for its hashing operation, fwiw.
1 Comment
If you are hashing a fixed set of words, the best hash function is often a perfect hash function. However, they generally require that the set of words you are trying to hash is known at compile time. Detection of keywords in a lexer (and translation of keywords to tokens) is a common usage of perfect hash functions generated with tools such as gperf. A perfect hash also lets you replace hash_map with a simple array or vector.
If you're not hashing a fixed set of words, then obviously this doesn't apply.
Comments
One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)
template <> struct myhash{};
template <> struct myhash<string>
{
size_t operator()(string &to_hash) const
{
const char * in = to_hash.c_str();
size_t out=0;
while(NULL != *in)
{
out*= 53; //just a prime number
out+= *in;
++in;
}
return out;
}
};
hash_map<string, int, myhash<string> > my_hash_map;
It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.
You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.
1 Comment
I did a little searching, and funny thing, Paul Larson's little algorithm showed up here http://www.strchr.com/hash_functions as having the least collisions of any tested in a number of conditions, and it's very fast for one that it's unrolled or table driven.
Larson's being the simple multiply by 101 and add loop above.
Comments
1 Comment
From Hash Functions all the way down:
MurmurHash got quite popular, at least in game developer circles, as a “general hash function”.
It’s a fine choice, but let’s see later if we can generally do better. Another fine choice, especially if you know more about your data than “it’s gonna be an unknown number of bytes”, is to roll your own (e.g. see Won Chun’s replies, or Rune’s modified xxHash/Murmur that are specialized for 4-byte keys etc.). If you know your data, always try to see whether that knowledge can be used for good effect!
Without more information I would recommend MurmurHash as a general purpose non-cryptographic hash function. For small strings (of the size of the average identifier in programs) the very simple and famous djb2 and FNV are very good.
Here (data sizes < 10 bytes) we can see that the ILP smartness of other algorithms does not get to show itself, and the super-simplicity of FNV or djb2 win in performance.
djb2
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
FNV-1
hash = FNV_offset_basis
for each byte_of_data to be hashed
hash = hash × FNV_prime
hash = hash XOR byte_of_data
return hash
FNV-1A
hash = FNV_offset_basis
for each byte_of_data to be hashed
hash = hash XOR byte_of_data
hash = hash × FNV_prime
return hash
A note about security and availability
Hash functions can make your code vulnerable to denial-of-service attacks. If an attacker is able to force your server to handle too many collisions, your server may not be able to cope with requests.
Some hash functions like MurmurHash accept a seed that you can provide to drastically reduce the ability of attackers to predict the hashes your server software is generating. Keep that in mind.