Efficiently get hash for a struct smaller than size_t

Question

I have a struct which consists of five std::uint8_t. My software doesn't support 32 bit builds, only 64 bit builds. I want to use my struct as a key in an unordered map. Can I just add three extra bytes to the struct to fill full 64 bit and cast the struct to size_t in order to get a hash safely? Like this:

struct MyStruct
{
    std::uint8_t v1 = 0;
    std::uint8_t v2 = 0;
    std::uint8_t v3 = 0;
    std::uint8_t v4 = 0;
    std::uint8_t v5 = 0;
    std::uint8_t pack1 = 255;
    std::uint8_t pack2 = 255;
    std::uint8_t pack3 = 255;
};

namespace std
{
template <>
struct hash<MyStruct>
{
    size_t operator()(const MyStruct &s) const
    {
        static_assert(sizeof(size_t) == 8);
        static_assert(sizeof(MyStruct) == 8);
        const size_t *memptr = reinterpret_cast<const size_t*>(&s);
        return *memptr;
    }
};
}

You have five bytes – how did you handle them on 32 bit? You actually had the same problem there, too, didn't you? — Aconcagua
– Aconcagua, Commented May 24, 2023 at 8:06
I'm pretty sure this reinterpret_cast is not legal for hurting strict aliasing rules, by the way... — Aconcagua
– Aconcagua, Commented May 24, 2023 at 8:08
Unrelated: Consider using std::bitcast, if available, since this also ensures the sizes of the source and target types are the same. If it isn't, you should definetly add a static_assert to check for equal sizes; otherwise you may at some point use a compiler with non-64 bit size_t without thinking about this specific part of the code, possibly making the hash algorithm much worse depending on the value distribution without you realizing it... — fabian
– fabian, Commented May 24, 2023 at 8:10
Just do some relative prime arithmetic on the fields most likely to change. E.g. s.v1*17 + s.v2*13 .... Just find a combination that is likely to result in a unique hash. If you think that is not efficient enough do some profiling and MEASURE. In the end the "uniqueness" of your hash is more important then it being the fastest to calculate (a hash collision is way more costly). — Pepijn Kramer
– Pepijn Kramer, Commented May 24, 2023 at 9:02

Tony Delroy · Accepted Answer · 2023-05-30 07:26:25Z

Can I just add three extra bytes to the struct to fill full 64 bit and cast the struct to size_t in order to get a hash safely?

No - as others have mentioned, you'll have undefined behaviour due to both MyStruct potentially having different alignment than size_t, and due to aliasing (you can only safely reinterpret_cast the size_t through char*, unsigned char* or std::byte*). As of C++20, std::bitcast is the recommended way to do this: std::bitcast<size_t>(some_MyStruct_object).

While the above's already been said by Red.Wave and nielsen, Red.Wave mentioned:

This will modify the result in case std::hash on built-in integrals is anything other than identity.

In practice, std::hash<size_t> - to be best of my knowledge - is an identity hash function in clang, GCC, and MSVC++. Certainly in current and all vaguely recent versions of clang and GCC (I've just rechecked on godbolt). Thankfully they use prime numbers for bucket count, so it doesn't matter. But MSVC++ has historically (and I imagine still, but godbolt won't execute code under MSVC++) used powers-of-two for bucket count, so it does matter.

On MSVC++ and any other implementation with power-of-two bucket count, the simple bitcast approach will create terrible hash table collisions. When the hash function returns a number it is folded into the bucket_count by masking with the number of buckets - 1, which effectively only uses however many of the least-significant bits are necessary to identify a bucket (for 64 buckets -> 6 bits, for 128 buckets -> 7 bits etc.).

To try to make this clearer, say your MyStruct object has values {ab, cd, ef, gh, ij, pad1, pad2, pad3} - where the two-letter combinations represent 2-digit hex value representations of your uint8_ts, and your hash table bucket_count is currently 256. You hash your object and end up with - it your system is little endian - FFFF'FFij'ghef'cdab. Then you mask out the low order 8 bits to get a 0..255 bucket index. Only that byte - ab - from your MyStruct object will affect the bucket you hash/mask to. If your data was {1, 2, 3, 4, 5}, {1, 202, 18, 48, 2}, {1, 7, 27, 87, 85}, {1, 48, 26, 58, 16} -> all those entries would collide at bucket 1. Your hash table then performs like a linked list. If - with your endianness - padding bytes are moved into less signficant bit positions in the size_t, they won't contribute in the slightest to randomise/dispersing your bucket usage.

While it's reasonable to first generate a size_t value from MyStruct with a bitcast, you may want to then perform some actual, meaningful hashing on it. As mentioned, you typically can't simply invoke std::hash<size_t>() on it, as that's often an identity hash. So, find an SO question or reference with a decent hash for size_t, or use something like the the Intel CRC instruction _mm_crc32_u64.

(Because these things are tricky and implementation choices sometimes surprising, when you have reason to care about performance, it's generally a good idea to measure collision chain lengths with your data and hash function, to ensure you don't have unexpected collision rates.)

Thanks for this very informative response which made things a lot clearer for me.

nielsen · Accepted Answer · 2023-05-24 08:39:34Z

1

The proposed solution is prone to undefined behavior at least for the reason that memptr may not fulfil the alignment requirement of a size_t.

A better alternative is to use memcpy:

    #include<algorithm>  // std::min

    size_t operator()(const MyStruct &s) const
    {
        size_t result = 0;
        memcpy(&result, &s, std::min(sizeof(s), sizeof(size_t));
        return result;
    }

This should work regardless of size differences between the struct and size_t, but of course will only distinguish two structs on the first sizeof(size_t) bytes.

It should not be a problem with your struct, but generally, you have to be aware that a struct may contain padding bytes with uncontrolled values that may mess up the hash result.

edited May 24, 2023 at 8:39

answered May 24, 2023 at 8:32

nielsen

8,0911 gold badge18 silver badges35 bronze badges

5 Comments

fschmitt Over a year ago

But isn't padding included in sizeof? So if sizeof(MyStruct) == sizeof(size_t) and I check this with the static_assert above, I don't see how it can fail (at least with the compilers my continuous integration platform checks).

nielsen Over a year ago

@fschmitt Yes, with the check of the struct size you make sure that there is no padding and as said it is not likely to be an issue with your struct. The remark was meant for the more general case of interpreting the byte representation of an arbitrary struct.

Dave S Over a year ago

Is there a guarantee that the padding bits are always set consistently? Otherwise, if your structure has padding, this relies on the padding bits to be consistent across the entire program.

Passer By Over a year ago

@DaveS No there isn't.

nielsen Over a year ago

@DaveS That is the potential issue I wanted to flag with this remark. To my understanding, you can only rely on the padding bytes to be zero if the struct was originally initialized.

Red.Wave · Accepted Answer · 2023-05-24 10:54:30Z

1

You need the type as a key to standard unorderd containers, and it is smaller in size than std::size_t. Therefore the bit pattern can be used as a perfect hash function. You do not need strange techniques for it:

std::size_t hash_res = 0;
for(auto byte:
         std::to_array<std::uint8_t>
         ({s.v1, s.v2, s.v3, s.v4, s.v5}))
    (hash_res<<=8)+=byte;

If the compiler can't optimize it, the more complicated version would be:

auto const hash_res=
     std::bit_cast<std::size_t>
     (std::array<std::uint8_t, sizeof(std::size_t)>
     {s.v1, s.v2, s.v3, s.v4, s.v5});

To wrap things up:

return std::hash<std::size_t>{}(hash_res);

This will modify the result in case std::hash on built-in integrals is anything other than identity. If you prefer a perfect hash, you can skip rehashing as integer to avoid probable truncation that leads to collision probability.

answered May 24, 2023 at 10:54

Red.Wave

4,68914 silver badges21 bronze badges

2 Comments

Jim Mischel Over a year ago

Exactly. If your potential key size is 64 bits and your actual key is only 40 bits, then there's no need for hashing at all. Just use the actual key.

Red.Wave Over a year ago

@JimMischel that's what OP snippet seems to be trying to do. I am putting less effort to achieve the same with more readabily.

Collectives™ on Stack Overflow

Efficiently get hash for a struct smaller than size_t

3 Answers 3

1 Comment

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related