Hashing Techniques, using unordered_map

Question

So my professor just assigned this homework assignment. I know my fare share of hashing techniques, but I have absolutely no idea how to go about not losing a lot of points due to collisions, because 1 Million strings will literally brute force collisions into my hash table.

What should I focus on?

Creating a really good re-hashing technique to detect when a collision occurs and appropriately re-hash
Focus on how to convert the strings into unique integers as to avoid collisions using some kind of prime number based modulus.

Or maybe I'm just misunderstanding the assignment completely. How would you guys go about solving this. Any ideas would be really helpful.

I don't see that asking you for a hash table at all.. just the hash function itself.... anyhow, for specific questions I'd ask the TA. — Buddy
– Buddy, Commented Apr 21, 2016 at 5:30
The assignment doesn't mention it, but the professor said it would be easier if we used unordered_map. Is it possible to actually edit the hashing function that unordered_map uses to solve this? — sampjr
– sampjr, Commented Apr 21, 2016 at 5:37
@SamPerales Yes (read some documentation) but a hash function has to return the same hash for the same argument. Collision detection is someone else's job. — juanchopanza
– juanchopanza, Commented Apr 21, 2016 at 5:51
@SamPerales: there's nothing in the assignment proper that needs unordered_map, but if you wanted to test your hash function before submission using your own set of random strings, a std::unordered_map<uint32_t, int> num_values; would let you count the number of string values hashing to a given value: for (auto& s : strings) ++num_collisions[hash(s)];. A loop over num_values can then look for values >= 2, indicative of collisions. — Tony Delroy
– Tony Delroy, Commented Apr 21, 2016 at 6:41
That said, if your hash function was plugged into unordered_map and the strings were stored, you'd also get a number of collisions indicative of the hash function quality, but compounded by the wrapping of the 32-bit hash space into the lesser bucket-index space (with the default max_load_factor of 1.0, you can expect >= 1,000,000 but < ~ 2,000,000 buckets). — Tony Delroy
– Tony Delroy, Commented Apr 21, 2016 at 6:43

Markus Kull · Accepted Answer · 2016-04-21 07:31:06Z

1

The task is to create a hashfunction with zero collisions. TonyD just calculated the expected collisions to be 116. According to the grading you will get zero points for a hashfunction with 116 collisions.

The professor gave a hint to use unordered_map which doesnt help for designing hashfunctions. It may be a trick question...

How would you design a function which returns a repeatable, unique number for 1 million inputs?

answered Apr 21, 2016 at 7:31

Markus Kull

1,44913 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sampjr Over a year ago

I can't think of anything clever. The only thing I can resort to is, allowing say, 116 collisions, then going back and somehow make room for them using a second different hash function

Tony Delroy Over a year ago

@MarkusKull: "It may be a trick question..." - that would be pretty perverse, and of little educational value if hashing is the topic, but who knows....

Community · Accepted Answer · 2017-05-23 12:16:12Z

1

Your teacher's asking you to hash 1 million strings and you have 2^32 = 4,294,967,296 distinct 32-bit integer values available.

With 20 character random strings, there's massively more possible strings than hash values, so you can't map specific strings onto specific hash values in a way that limits the collision potential (i.e. say you had <= 2^32 potential strings because the string length was shorter, or the values each character was allowed to take were restricted - you'd have a chance at a perfect hash function: a formula mapping each to a known distinct number).

So, you're basically left having to try to randomly but repeatably map from strings to hash values. The "Birthday Paradox" then kicks in, meaning you must expect quite a lot of collisions. How many? Well - this answer provides the formula - for m buckets (2^32) and n inserts (1,000,000):

expected collisions = n - m * (1 - ((m-1)/m)^n)

                    = 1,000,000 - 2^32 * (1 - ((2^32 - 1) / 2^32) ^ 1,000,000)

                    = 1,000,000 - 2^32 * (1 - 0.99976719645926983712557804052625)

                    ~= 1,000,000 - 999883.6

                    ~= 116.4

Put another way, the very best possible hash function would on average - for random string inputs - still have 116 collisions.

Your teacher says:

final score for you is max{0, 200 – 5*T}

So, there's no point doing the assignment: you're more likely to have a 24 carat gold meteor land in your front garden than get a positive score.

That said, if you want to achieve the lowest number of collisions for the class, a lowish-performance (not particularly cache friendly) but minimal collision option is simply to have an array of random data...

uint32_t data[20][256] = { ... };

Download some genuinely random data from an Internet site to populate it with. Discard any duplicate numbers (in C++, you can use a std:set<> to find them). Index by character position (0..19) then character value, generating your hash by XORing the values.

Illustration of collisions

If unconvinced by the information above, you can generate a million random 32-bit values - as if they were hashes of distinct strings - and see how often the hash values repeat. Any given run should produce output not too far from the 116 collision average calculated above.

#include <iostream>
#include <map>
#include <random>

int main()
{
    std::random_device rd;
    std::map<unsigned, int> count;
    for (int i = 0; i < 1000000; ++i)
        ++count[rd()];
    std::map<int, int> histogram;
    for (auto& c : count)
        ++histogram[c.second];
    for (auto& h : histogram)
        std::cout << h.second << " hash values generated by " << h.first << " key(s)\n";
}

A few runs produced output...

$ ./poc
999752 hash values generated by 1 key(s)
124 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999796 hash values generated by 1 key(s)
102 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999784 hash values generated by 1 key(s)
108 hash values generated by 2 key(s)
$ ./poc
999744 hash values generated by 1 key(s)
128 hash values generated by 2 key(s)

edited May 23, 2017 at 12:16

CommunityBot

11 silver badge

answered Apr 21, 2016 at 6:32

Tony Delroy

107k16 gold badges188 silver badges265 bronze badges

4 Comments

sampjr Over a year ago

Unfortunately I have no knowledge of unit32_t data types, but you have been very helpful. I will follow what you have answered and try to implement the array approach. Is there anything else I should try to make my life easier?

Tony Delroy Over a year ago

uint32_t is an optional part of C++11 - see here. On most hardware, it's the same as unsigned int, so if your <cstdint> header lacks it, just typedef unsigned uint32_t or use unsigned (the int is implicit). I'm not sure how you imagine any quality hash algorithm could be easier than xoring some data from array lookups... I've never seen anything nearly as simple.

sampjr Over a year ago

Oh I see, thank you. I have a feeling I'm going in the right direction here. I just have one question, is the array lookup specific to an array, or will other faster, linear data structures interfere with the algorithm?

Tony Delroy Over a year ago

@SamPerales: "will other faster, linear data structures interfere with the algorithm" - there is no data structure with faster indexed lookup than an array, so your question doesn't make sense.

Collectives™ on Stack Overflow

Hashing Techniques, using unordered_map

2 Answers 2

2 Comments

Illustration of collisions

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Illustration of collisions

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related