Is there a better implementation for keeping a count for unique integer pairs?

Question

This is in C++. I need to keep a count for every pair of numbers. The two numbers are of type "int". I sort the two numbers, so (n1 n2) pair is the same as (n2 n1) pair. I'm using the std::unordered_map as the container.

I have been using the elegant pairing function by Matthew Szudzik, Wolfram Research, Inc.. In my implementation, the function gives me a unique number of type "long" (64 bits on my machine) for every pair of two numbers of type "int". I use this long as my key for the unordered_map (std::unordered_map). Is there a better way to keep count of such pairs? By better I mean, faster and if possible with lesser memory usage.

Also, I don't need all the bits of long. Even though you can assume that the two numbers can range up to max value for 32 bits, I anticipate the max possible value of my pairing function to require at most 36 bits. If nothing else, at least is there a way to have just 36 bits as key for the unordered_map? (some other data type)

I thought of using bitset, but I'm not exactly sure if the std::hash will generate a unique key for any given bitset of 36 bits, which can be used as key for unordered_map.

I would greatly appreciate any thoughts, suggestions etc.

How about a std::set of length 2 for each pair? That way the order is not important. — Cory Kramer
– Cory Kramer, Commented Oct 6, 2014 at 18:21
Input can be anything. Positive integers. I've been using int, but unsigned int will also work. — learningToCode
– learningToCode, Commented Oct 6, 2014 at 18:26
long - don't rely on the machine, use more concrete types, e.g.: uint64_t — Karoly Horvath
– Karoly Horvath, Commented Oct 6, 2014 at 18:37

Slava · Accepted Answer · 2014-10-07 13:44:13Z

0

First of all I think you came with wrong assumption. For std::unordered_map and std::unordered_set hash does not have to be unique (and it cannot be in principle for data types like std::string for example), there should be low probability that 2 different keys will generate the same hash value. But if there is a collision it would not be end of the world, just access would be slower. I would generate 32bit hash from 2 numbers and if you have an idea of typical values just test for probability of hash collision and choose hash function accordingly.

For that to work you should use pair of 32bit numbers as a key in std::unordered_map and provide a proper hash function. Calculating unique 64bit key and use it with hash map is controversal as hash_map will then calculate another hash of this key, so it is possible you are making it slower.

About 36 bits key, this is not a good idea unless you have a special CPU that handles 36 bit data. Your data either will be aligned on 64bit boundary and you would not have any benefits of saving memory, or you will get penalty of unaligned data access otherwise. In first case you would just have extra code to get 36 bits from 64bit data (if processor supports it). In the second your code will be slower than 32 bit hash even if there are some collisions.

If that hash_map is a bottleneck you may consider different implementation of hash map like goog-sparsehash.sourceforge.net

edited Oct 7, 2014 at 13:44

answered Oct 6, 2014 at 18:42

Slava

44.4k2 gold badges54 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

learningToCode Over a year ago

Thank you. That makes sense. I wanted it to be unique so that I could just use the unordered_map. If it's not unique, then I should implement my own table, correct? Or am I going wrong somewhere?

Slava Over a year ago

@learningToCode updated answer, no you do not need to reimplement unordered_map

learningToCode Over a year ago

thanks a lot. That's really interesting and not obvious to me. If my hash generates the same key for two different inputs (however low probability), and lets call the key 'K' of type (uint32_t). Say I have it as std::unordered_map<uint32_t, int> table. I have been using it as table[K]++ to increment count. So, I don't see how the resolution of two distinct pairs mapped to K can be possible. I will look it up, but if it's something simple, please let me know or redirect me to it and thanks a lot.

Slava Over a year ago

@learningToCode you misunderstand the concept of hash map. Key in the map should be pair of numbers, not the hash. Hash function specified separately, and actually does not really matter if it produces 64 or 32 bits as hash is not stored in the map. If you really want to save space you need to find a way to pack that pair into 32bits uniquely, 36bits would not save the room neither increase speed, unless you find CPU that natively works with 36bits, which I doubt.

Slava Over a year ago

@learningToCode sorry did not fully understand your question originally and missed the part that you use 64bit as a key. Updated answer.

|

IdeaHat · Accepted Answer · 2014-10-06 21:18:44Z

0

Just my two cents, the pairing functions that you've got in the article are WAY more complicated than you actually need. Mapping 2 32 bit UNISIGNED values to 64 uniquely is easy. The following does that, and even handles the non-pair states, without hitting the math peripheral too heavily (if at all).

uint64_t map(uint32_t a, uint32_t b)
{
    uint64_t x = a+b;
    uint64_t y = abs((int32_t)(a-b));

    uint64_t ans = (x<<32)|(y);
    return ans;
}

void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{
  uint64_t x = map>>32;
  uint64_t y = map&0xFFFFFFFFL;

  *a = (x+y)>>1;
  *b = (x-*a);
}

Another alternative:

uint64_t map(uint32_t a, uint32_t b)
{
  bool bb = a>b;
    uint64_t x = ((uint64_t)a)<<(32*(bb));
    uint64_t y = ((uint64_t)b)<<(32*!(bb));

    uint64_t ans = x|y;
    return ans;
}

void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{

  *a = map>>32;
  *b = map&0xFFFFFFFF;
}

That works as a unique key. You can easily modify that to be a hash function provider for unordered map, though whether or not that will be faster than std::map is dependent on the number of values you've got.

NOTE: this will fail if the values a+b > 32 bits.

edited Oct 6, 2014 at 21:18

answered Oct 6, 2014 at 21:03

IdeaHat

7,9311 gold badge28 silver badges57 bronze badges

3 Comments

learningToCode Over a year ago

Thanks. I should have thought of that. Just curious why you need to add and subtract the two numbers and not just shift one to first 32 bits and the next number as the other 32 bits of the 64 bit number?

IdeaHat Over a year ago

@learningToCode I wanted to both avoid branching and capture the fact that (a,b)==(b,a). Also I have a tendency to over-think things. Provided an alternate that should do just what you suggested without branching, and is probably just as fast, though you'd have to measure it.

learningToCode Over a year ago

Thanks for your time. This is my first day on stackoverflow as a member. I'm learning quite a lot. Thanks!

Collectives™ on Stack Overflow

Is there a better implementation for keeping a count for unique integer pairs?

2 Answers 2

6 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related