1

I am using an extendible hash and I want to have strings as keys. The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad for the program's performance since the hash function is called multiple times especially when I am splitting buckets.

Current hash function

int hash(const string& key)
{
    int seed = 131;
    unsigned long hash = 0;
    for(unsigned i = 0; i < key.length(); i++)
    {
        hash = (hash * seed) + key[i];
    }
    return hash;
}

The keys could be as long as 40 characters.

Example of string/key

string key = "from-to condition"

I have searched over the internet for a better one but I didn't find anything to match my case. Any suggestions?

8
  • 2
    40 chars isn't that long. And the current hash function you use doesn't have any advantages over those in the std libs. Last time I checked, MSVC was using FNV1a for strings, which shouldn't be any slower than yours, but much better for hashing. I didn't check but GCC probably use the same. Commented Dec 18, 2015 at 10:11
  • your hash is bad. It can easily overflow, and integer overflow is undefined behavior. Commented Dec 18, 2015 at 10:13
  • No it can't (well, unless ULONG_MAX == INT_MAX, which is a pretty pathological implementation. In fact, I don't think it's legal.) The calculation is done in unsigned long for which overflow is perfectly defined. Commented Dec 18, 2015 at 10:18
  • Well I found it on the internet. I guess I' ll have to find another one. Commented Dec 18, 2015 at 10:18
  • @UmNyobe The fact that his hash is bad has nothing to do with overflow. As already said, unsigned overflow is perfectly defined. Almost all hashing algorithms use overflows. Commented Dec 18, 2015 at 10:20

3 Answers 3

4

I am using an extendible hash and I want to have strings as keys.

As mentioned before, use std::hash until there is a good reason not to.

The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad...

It's an understandable thought, but is actually unlikely to be a real concern.

(anticipating) why?

A quick scan over stack overflow will reveal many experienced developers talking about caches and cache lines.

(forgive me if I'm teaching my grandmother to suck eggs)

A modern CPU is incredibly quick at processing instructions and performing (very complex) arithmetic. In almost all cases, what limits its performance is having to talk to memory across a bus, which is by comparison, horribly slow.

So chip designers build in memory caches - extremely fast memory that sits in the CPU (and therefore does not have to be communicated with over a slow bus). Unhappily, there is only so much space available [plus heat constraints - a topic for another day] for this cache memory so the CPU must treat it like an OS does a disk cache, flushing memory and reading in memory as and when it needs to.

As mentioned, communicating across the bus is slow - (simply put) it requires all the electronic components on the motherboard to stop and synchronise with each other. This wastes a horrible amount of time [This would be a fantastic point to go into a discussion about the propagation of electronic signals across the motherboard being constrained by approximately half the speed of light - it's fascinating but there's only so much space here and I have only so much time]. So rather than transfer one byte, word or longword at a time, the memory is accessed in chunks - called cache lines.

It turns out that this is a good decision by chip designers because they understand that most memory is accessed sequentially - because most programs spend most of their time accessing memory linearly (such as when calculating a hash, comparing strings or objects, transforming sequences, copying and initialising sequences and so on).

What's the upshot of all this?

Well, bizarrely, if your string is not already in-cache it turns of that reading one byte of it is almost exactly as expensive as reading all the bytes in the first (say) 128 bytes of it.

Plus, because the cache circuitry assumes that memory access is linear, it will begin a fetch of the next cache line as soon as it has fetched your first one. It will do this while your CPU is performing its hash computation.

I hope you can see that in this case, even if your string was many thousands of bytes long, and you chose to only hash (say) every 128th byte, all you would be doing would be to compute a very much inferior hash which still causing the memory cache to halt the processor while it fetched large chunks of unused memory. It would take just as long - for a worse result!

Having said that, what are good reasons not to use the standard implementation?

Only when:

  1. The users are complaining that your software is too slow to be useful, and

  2. The program is verifiably CPU-bound (using 100% of CPU time), and

  3. The program is not wasting any cycles by spinning, and

  4. Careful profiling has revealed that the program's biggest bottleneck is the hash function, and

  5. Independent analysis by another experienced developer confirms that there is no way to improve the algorithm (for example by calling hash less often).

In short, almost never.

Sign up to request clarification or add additional context in comments.

1 Comment

Note that using 100% of CPU time doesn't mean the program is CPU-bound. When the CPU stalls for cache or memory access, it will still be counted as CPU time. You need a deeper analysis than the "%" to distinguish CPU-bound and memory-bound.
3

You should prefer to use std::hash unless measurement shows that you can do better. To limit the number of characters it uses, use something like:

    const auto limit = min(key.length(), 16);
    for(unsigned i = 0; i < limit; i++)

You will want to experiment to find the best value of 16 to use.

I would actually expect the performance to get worse (because you will have more collisions). If your strings were several k, then limiting to the first 64 bytes might well be worth while.

Depending on your strings, it might be worth starting not at the beginning. For example, hashing filenames you would probably do better using the characters between 20 and 5 from the end (ignore the often constant pathname prefix, and the file extension). But you still have to measure.

Comments

1

You can directly use std::hashlink instead of implementing your own function.

#include <iostream>
#include <functional>
#include <string>

size_t hash(const std::string& key)
{
    std::hash<std::string> hasher;
    return hasher(key);
}

int main() {
    std::cout << hash("abc") << std::endl;
    return 0;
}

See this code here: https://ideone.com/4U89aU

3 Comments

That's a great point of departure, but this is implementation defined and if too slow some tips on how to write a fast(er) hash function might be sensible...?
@TonyDelroy I am sure there is huge amount of computer science works for phd in finding faster hash function. and each believes their is the best. you will need a whole separate website with test cases and discussions at length which would need years to mature.. or you can just use std::hash
@BoppityBop: all true, but this specific question asks for a string hash function that doesn't iterate over the entire string (..."over the whole string/key and I think that this is pretty bad for the program's performance"). Using std::hash<> is abdicating control of that implementation aspect. It so happens that GCC and clang's implementations will iterate over the entire string, and Micrtosoft Visual C++'s Standard Library will not (but what it does do - picking 10 characters evenly spaced throughout the string, isn't helpful for ~40 character strings). Seems you're say Q is too hard?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.