1

So my professor just assigned this homework assignment. I know my fare share of hashing techniques, but I have absolutely no idea how to go about not losing a lot of points due to collisions, because 1 Million strings will literally brute force collisions into my hash table.

What should I focus on?

  1. Creating a really good re-hashing technique to detect when a collision occurs and appropriately re-hash
  2. Focus on how to convert the strings into unique integers as to avoid collisions using some kind of prime number based modulus.

Or maybe I'm just misunderstanding the assignment completely. How would you guys go about solving this. Any ideas would be really helpful.

5
  • 2
    I don't see that asking you for a hash table at all.. just the hash function itself.... anyhow, for specific questions I'd ask the TA. Commented Apr 21, 2016 at 5:30
  • The assignment doesn't mention it, but the professor said it would be easier if we used unordered_map. Is it possible to actually edit the hashing function that unordered_map uses to solve this? Commented Apr 21, 2016 at 5:37
  • 1
    @SamPerales Yes (read some documentation) but a hash function has to return the same hash for the same argument. Collision detection is someone else's job. Commented Apr 21, 2016 at 5:51
  • 1
    @SamPerales: there's nothing in the assignment proper that needs unordered_map, but if you wanted to test your hash function before submission using your own set of random strings, a std::unordered_map<uint32_t, int> num_values; would let you count the number of string values hashing to a given value: for (auto& s : strings) ++num_collisions[hash(s)];. A loop over num_values can then look for values >= 2, indicative of collisions. Commented Apr 21, 2016 at 6:41
  • 1
    That said, if your hash function was plugged into unordered_map and the strings were stored, you'd also get a number of collisions indicative of the hash function quality, but compounded by the wrapping of the 32-bit hash space into the lesser bucket-index space (with the default max_load_factor of 1.0, you can expect >= 1,000,000 but < ~ 2,000,000 buckets). Commented Apr 21, 2016 at 6:43

2 Answers 2

1

The task is to create a hashfunction with zero collisions. TonyD just calculated the expected collisions to be 116. According to the grading you will get zero points for a hashfunction with 116 collisions.

The professor gave a hint to use unordered_map which doesnt help for designing hashfunctions. It may be a trick question...

How would you design a function which returns a repeatable, unique number for 1 million inputs?

Sign up to request clarification or add additional context in comments.

2 Comments

I can't think of anything clever. The only thing I can resort to is, allowing say, 116 collisions, then going back and somehow make room for them using a second different hash function
@MarkusKull: "It may be a trick question..." - that would be pretty perverse, and of little educational value if hashing is the topic, but who knows....
1

Your teacher's asking you to hash 1 million strings and you have 2^32 = 4,294,967,296 distinct 32-bit integer values available.

With 20 character random strings, there's massively more possible strings than hash values, so you can't map specific strings onto specific hash values in a way that limits the collision potential (i.e. say you had <= 2^32 potential strings because the string length was shorter, or the values each character was allowed to take were restricted - you'd have a chance at a perfect hash function: a formula mapping each to a known distinct number).

So, you're basically left having to try to randomly but repeatably map from strings to hash values. The "Birthday Paradox" then kicks in, meaning you must expect quite a lot of collisions. How many? Well - this answer provides the formula - for m buckets (2^32) and n inserts (1,000,000):

expected collisions = n - m * (1 - ((m-1)/m)^n)

                    = 1,000,000 - 2^32 * (1 - ((2^32 - 1) / 2^32) ^ 1,000,000)

                    = 1,000,000 - 2^32 * (1 - 0.99976719645926983712557804052625)

                    ~= 1,000,000 - 999883.6

                    ~= 116.4

Put another way, the very best possible hash function would on average - for random string inputs - still have 116 collisions.

Your teacher says:

final score for you is max{0, 200 – 5*T}

So, there's no point doing the assignment: you're more likely to have a 24 carat gold meteor land in your front garden than get a positive score.

That said, if you want to achieve the lowest number of collisions for the class, a lowish-performance (not particularly cache friendly) but minimal collision option is simply to have an array of random data...

uint32_t data[20][256] = { ... };

Download some genuinely random data from an Internet site to populate it with. Discard any duplicate numbers (in C++, you can use a std:set<> to find them). Index by character position (0..19) then character value, generating your hash by XORing the values.


Illustration of collisions

If unconvinced by the information above, you can generate a million random 32-bit values - as if they were hashes of distinct strings - and see how often the hash values repeat. Any given run should produce output not too far from the 116 collision average calculated above.

#include <iostream>
#include <map>
#include <random>

int main()
{
    std::random_device rd;
    std::map<unsigned, int> count;
    for (int i = 0; i < 1000000; ++i)
        ++count[rd()];
    std::map<int, int> histogram;
    for (auto& c : count)
        ++histogram[c.second];
    for (auto& h : histogram)
        std::cout << h.second << " hash values generated by " << h.first << " key(s)\n";
}

A few runs produced output...

$ ./poc
999752 hash values generated by 1 key(s)
124 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999796 hash values generated by 1 key(s)
102 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999784 hash values generated by 1 key(s)
108 hash values generated by 2 key(s)
$ ./poc
999744 hash values generated by 1 key(s)
128 hash values generated by 2 key(s)

4 Comments

Unfortunately I have no knowledge of unit32_t data types, but you have been very helpful. I will follow what you have answered and try to implement the array approach. Is there anything else I should try to make my life easier?
uint32_t is an optional part of C++11 - see here. On most hardware, it's the same as unsigned int, so if your <cstdint> header lacks it, just typedef unsigned uint32_t or use unsigned (the int is implicit). I'm not sure how you imagine any quality hash algorithm could be easier than xoring some data from array lookups... I've never seen anything nearly as simple.
Oh I see, thank you. I have a feeling I'm going in the right direction here. I just have one question, is the array lookup specific to an array, or will other faster, linear data structures interfere with the algorithm?
@SamPerales: "will other faster, linear data structures interfere with the algorithm" - there is no data structure with faster indexed lookup than an array, so your question doesn't make sense.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.