Your teacher's asking you to hash 1 million strings and you have 2^32 = 4,294,967,296 distinct 32-bit integer values available.
With 20 character random strings, there's massively more possible strings than hash values, so you can't map specific strings onto specific hash values in a way that limits the collision potential (i.e. say you had <= 2^32 potential strings because the string length was shorter, or the values each character was allowed to take were restricted - you'd have a chance at a perfect hash function: a formula mapping each to a known distinct number).
So, you're basically left having to try to randomly but repeatably map from strings to hash values. The "Birthday Paradox" then kicks in, meaning you must expect quite a lot of collisions. How many? Well - this answer provides the formula - for m buckets (2^32) and n inserts (1,000,000):
expected collisions = n - m * (1 - ((m-1)/m)^n)
= 1,000,000 - 2^32 * (1 - ((2^32 - 1) / 2^32) ^ 1,000,000)
= 1,000,000 - 2^32 * (1 - 0.99976719645926983712557804052625)
~= 1,000,000 - 999883.6
~= 116.4
Put another way, the very best possible hash function would on average - for random string inputs - still have 116 collisions.
Your teacher says:
final score for you is max{0, 200 – 5*T}
So, there's no point doing the assignment: you're more likely to have a 24 carat gold meteor land in your front garden than get a positive score.
That said, if you want to achieve the lowest number of collisions for the class, a lowish-performance (not particularly cache friendly) but minimal collision option is simply to have an array of random data...
uint32_t data[20][256] = { ... };
Download some genuinely random data from an Internet site to populate it with. Discard any duplicate numbers (in C++, you can use a std:set<> to find them). Index by character position (0..19) then character value, generating your hash by XORing the values.
Illustration of collisions
If unconvinced by the information above, you can generate a million random 32-bit values - as if they were hashes of distinct strings - and see how often the hash values repeat. Any given run should produce output not too far from the 116 collision average calculated above.
#include <iostream>
#include <map>
#include <random>
int main()
{
std::random_device rd;
std::map<unsigned, int> count;
for (int i = 0; i < 1000000; ++i)
++count[rd()];
std::map<int, int> histogram;
for (auto& c : count)
++histogram[c.second];
for (auto& h : histogram)
std::cout << h.second << " hash values generated by " << h.first << " key(s)\n";
}
A few runs produced output...
$ ./poc
999752 hash values generated by 1 key(s)
124 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999796 hash values generated by 1 key(s)
102 hash values generated by 2 key(s)
$ ./poc
999776 hash values generated by 1 key(s)
112 hash values generated by 2 key(s)
$ ./poc
999784 hash values generated by 1 key(s)
108 hash values generated by 2 key(s)
$ ./poc
999744 hash values generated by 1 key(s)
128 hash values generated by 2 key(s)
unordered_map. Is it possible to actually edit the hashing function thatunordered_mapuses to solve this?unordered_map, but if you wanted to test your hash function before submission using your own set of random strings, astd::unordered_map<uint32_t, int> num_values;would let you count the number of string values hashing to a given value:for (auto& s : strings) ++num_collisions[hash(s)];. A loop overnum_valuescan then look for values >= 2, indicative of collisions.unordered_mapand the strings were stored, you'd also get a number of collisions indicative of the hash function quality, but compounded by the wrapping of the 32-bit hash space into the lesser bucket-index space (with the defaultmax_load_factorof 1.0, you can expect >= 1,000,000 but < ~ 2,000,000 buckets).