What's the best hashing algorithm to use on a stl string when using hash_map?

Question

I've found the standard hashing function on VS2005 is painfully slow when trying to achieve high performance look ups. What are some good examples of fast and efficient hashing algorithms that should void most collisions?

The following has a good set of general purpose hash functions, you should try them against your data set, some may outperform others based on collisions: partow.net/programming/hashfunctions/index.html — Matthieu N.
– Matthieu N., Commented Oct 24, 2009 at 9:00

George V. Reilly · Accepted Answer · 2015-05-01 00:55:11Z

66

I worked with Paul Larson of Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.

unsigned int
hash(
    const char* s,
    unsigned int seed = 0)
{
    unsigned int hash = seed;
    while (*s)
    {
        hash = hash * 101  +  *s++;
    }
    return hash;
}

edited May 1, 2015 at 0:55

answered Sep 20, 2008 at 8:46

George V. Reilly

16.4k7 gold badges46 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Nils Pipenbrinck Over a year ago

Hey George. I tried the code in the hash-benchmark I've postet in my answer. Nice find. It does not excel in performance or collisions, but it always gives consistent results. Seems like it's a good and cheap candidate for general purpose string-hashing.

Soumajyoti Sarkar Over a year ago

But this works only for small length strings. For large cases, it overflows majority of the time.

George V. Reilly Over a year ago

Soumajyoti, the overflow doesn't matter. Most hash functions overflow. The point is that you get a decent mix of bits in the low-order 32-bits.

gaps Over a year ago

That resembles the Java implementation, but it uses 31 instead of 101.

Sai Suman Chitturi Over a year ago

How can we find the hash of substring, given we have already calculated the hashes of all prefixes of given String? Example, consider s = "hello", the hashes of all prefixes of s are: [104, 10605, 1071213, 108192621, 2337520240] How to quickly find hash of substring hel?

Dark Shikari · Accepted Answer · 2008-09-19 00:01:51Z

19

From some old code of mine:

/* magic numbers from http://www.isthe.com/chongo/tech/comp/fnv/ */
static const size_t InitialFNV = 2166136261U;
static const size_t FNVMultiple = 16777619;

/* Fowler / Noll / Vo (FNV) Hash */
size_t myhash(const string &s)
{
    size_t hash = InitialFNV;
    for(size_t i = 0; i < s.length(); i++)
    {
        hash = hash ^ (s[i]);       /* xor  the low 8 bits */
        hash = hash * FNVMultiple;  /* multiply by the magic number */
    }
    return hash;
}

Its fast. Really freaking fast.

answered Sep 19, 2008 at 0:01

Dark Shikari

8,0294 gold badges29 silver badges38 bronze badges

5 Comments

Matthieu N. Over a year ago

it may be fast, but its probably one of the WORST hash functions ever invented.

Albert Over a year ago

@Matthieu: Why? Many duplicates? Do you have any references where I can read more about it?

Mooing Duck Over a year ago

@Albert: ^ is transitive, which is bad. FNVMultiple is not prime, which is bad. InitialFNV isn't prime either, which may or may not be bad, I'm uncertain.

bysreg Over a year ago

@MooingDuck FNVMultiple seems to be a prime number.

Nick Over a year ago

16777619 is a (proven) prime. 2166136261 is (proven) composite (failed sprp test base 2). primes.utm.edu/curios/includes/primetest.php

Nils Pipenbrinck · Accepted Answer · 2008-09-18 23:59:19Z

7

That always depends on your data-set.

I for one had surprisingly good results by using the CRC32 of the string. Works very good with a wide range of different input sets.

Lots of good CRC32 implementations are easy to find on the net.

Edit: Almost forgot: This page has a nice hash-function shootout with performance numbers and test-data:

http://smallcode.weblogs.us/ <-- further down the page.

answered Sep 18, 2008 at 23:59

Nils Pipenbrinck

86.7k33 gold badges157 silver badges224 bronze badges

Comments

David Pierre · Accepted Answer · 2008-09-19 00:01:31Z

7

Boost has an boost::hash library which can provides some basic hash functions for most common types.

answered Sep 19, 2008 at 0:01

David Pierre

9,5754 gold badges43 silver badges32 bronze badges

Comments

SquareCog · Accepted Answer · 2008-09-19 00:24:04Z

6

I've use the Jenkins hash to write a Bloom filter library, it has great performance.

Details and code are available here: http://burtleburtle.net/bob/c/lookup3.c

This is what Perl uses for its hashing operation, fwiw.

answered Sep 19, 2008 at 0:24

SquareCog

19.7k8 gold badges52 silver badges63 bronze badges

1 Comment

Soren Over a year ago

Also look at the spooky hash which is an improvement on Jenkins

bk1e · Accepted Answer · 2010-02-06 16:06:27Z

6

If you are hashing a fixed set of words, the best hash function is often a perfect hash function. However, they generally require that the set of words you are trying to hash is known at compile time. Detection of keywords in a lexer (and translation of keywords to tokens) is a common usage of perfect hash functions generated with tools such as gperf. A perfect hash also lets you replace hash_map with a simple array or vector.

If you're not hashing a fixed set of words, then obviously this doesn't apply.

edited Feb 6, 2010 at 16:06

answered Sep 19, 2008 at 3:13

bk1e

24.4k6 gold badges58 silver badges65 bronze badges

Comments

Brian · Accepted Answer · 2008-09-19 01:30:43Z

2

One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)

  template <> struct myhash{};

  template <> struct myhash<string>
    {
    size_t operator()(string &to_hash) const
      {
      const char * in = to_hash.c_str();
      size_t out=0;
      while(NULL != *in)
        {
        out*= 53; //just a prime number
        out+= *in;
        ++in;
        }
      return out;
      }
    };

  hash_map<string, int, myhash<string> > my_hash_map;

It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.

You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.

edited Sep 19, 2008 at 1:30

answered Sep 19, 2008 at 0:18

Brian

8,9035 gold badges30 silver badges30 bronze badges

1 Comment

Helge Klein Over a year ago

Yoda condition alert! Apart from that this is similar to the Larson algorithm (I noticed this was posted earlier!).

Josh S · Accepted Answer · 2012-02-20 02:41:45Z

2

I did a little searching, and funny thing, Paul Larson's little algorithm showed up here http://www.strchr.com/hash_functions as having the least collisions of any tested in a number of conditions, and it's very fast for one that it's unrolled or table driven.

Larson's being the simple multiply by 101 and add loop above.

answered Feb 20, 2012 at 2:41

Josh S

1471 silver badge7 bronze badges

Comments

George V. Reilly · Accepted Answer · 2014-03-19 18:29:30Z

2

Python 3.4 includes a new hash algorithm based on SipHash. PEP 456 is very informative.

answered Mar 19, 2014 at 18:29

George V. Reilly

16.4k7 gold badges46 silver badges39 bronze badges

1 Comment

David Soroko Over a year ago

I run some benchmarks and SipHash looks very good

felipecrv · Accepted Answer · 2016-12-24 17:27:38Z

From Hash Functions all the way down:

MurmurHash got quite popular, at least in game developer circles, as a “general hash function”.

It’s a fine choice, but let’s see later if we can generally do better. Another fine choice, especially if you know more about your data than “it’s gonna be an unknown number of bytes”, is to roll your own (e.g. see Won Chun’s replies, or Rune’s modified xxHash/Murmur that are specialized for 4-byte keys etc.). If you know your data, always try to see whether that knowledge can be used for good effect!

Without more information I would recommend MurmurHash as a general purpose non-cryptographic hash function. For small strings (of the size of the average identifier in programs) the very simple and famous djb2 and FNV are very good.

Here (data sizes < 10 bytes) we can see that the ILP smartness of other algorithms does not get to show itself, and the super-simplicity of FNV or djb2 win in performance.

djb2

unsigned long
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

FNV-1

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash × FNV_prime
     hash = hash XOR byte_of_data
return hash

FNV-1A

hash = FNV_offset_basis
for each byte_of_data to be hashed
     hash = hash XOR byte_of_data
     hash = hash × FNV_prime
return hash

A note about security and availability

Hash functions can make your code vulnerable to denial-of-service attacks. If an attacker is able to force your server to handle too many collisions, your server may not be able to cope with requests.

Some hash functions like MurmurHash accept a seed that you can provide to drastically reduce the ability of attackers to predict the hashes your server software is generating. Keep that in mind.

MSalters · Accepted Answer · 2008-09-19 11:14:21Z

0

If your strings are on average longer than a single cache line, but their length+prefix are reasonably unique, consider hasing just the length+first 8/16 characters. (The length is contained in the std::string object itself and therefore cheap to read)

answered Sep 19, 2008 at 11:14

MSalters

182k11 gold badges171 silver badges376 bronze badges

Collectives™ on Stack Overflow

What's the best hashing algorithm to use on a stl string when using hash_map?

11 Answers 11

5 Comments

5 Comments

Comments

Comments

1 Comment

Comments

1 Comment

Comments

1 Comment

djb2

FNV-1

FNV-1A

A note about security and availability

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

5 Comments

5 Comments

Comments

Comments

1 Comment

Comments

1 Comment

Comments

1 Comment

A note about security and availability

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related