Why java hashcode implementation 31 * x + y is better than x + y?

Question

I'm confused with java interview question about which hashcode implementation is better. We have a class Point {int x, y; }. Why does implementation of hashcode 31 * x + y for this class is better than x + y ? The right answer is "The multiplier creates a dependence of the hashcode value on the order of processing the fields, which ultimately gives rise to a better hash function". But I can't understand why the order of processing is the point here because the entire expression 31 * x + y is calculating when I executes point1.equals(point2); And there is no matter in which order it happens. Am I wrong?

If you happen to use Point as a key in map/hashtable, having hashCode() {return x + y;} would create redundant hash collision for point (1,2) and point (2,1), etc. — Nowhere Man
– Nowhere Man, Commented Apr 20, 2020 at 7:59
I think the amount of collisions will remain the same. With hashCode() {return 31 * x + y;}, pairs (0,31) and (1,0) will have a collision as well. — Amongalen
– Amongalen, Commented Apr 20, 2020 at 8:03
@Amongalen It's always possible to have collisions, but in real-world usage the values 0 and 1 will be found much more frequently than 31. — chrylis -cautiouslyoptimistic-
– chrylis -cautiouslyoptimistic-, Commented Apr 20, 2020 at 8:06
@Amongalen The is no perfect hash, and all have collisions. But the function used may lead to less (or more) collision in usual cases. x+y would introduce collisions by symmetry (a very usual property in geometry). — Jean-Baptiste Yunès
– Jean-Baptiste Yunès, Commented Apr 20, 2020 at 8:07
@Amongalen your link doesn't explain why it is better, just that it is used in Java. 31 has been chosen empirically so that the tradeoff between collisions and effective computation. The general method has been studied by Knuth, using polynomials and powers of primes. — Jean-Baptiste Yunès
– Jean-Baptiste Yunès, Commented Apr 20, 2020 at 8:23

Jean-Baptiste Yunès · Accepted Answer · 2024-03-20 16:36:26Z

11

If you use x+y then how to distinguish points (3,4) and (4,3)? Both will have the same hashcode...

Now while 31 * x + y will not be perfect, in the same case, it will be much much better.

Note: by definition of hashing there is no perfect hashing. The only thing is to analyse what kind of collisions occurs for a given hash function. In the geometrical case the first one introduces collisions for a very simple and usual symmetry property. Thus in very common cases there could be too much collisions.

----EDIT----

Note that 31x+y ensures that no points in the rectangle (0,0)—(huge value,30) will collide, etc. A point (x,y) will collide only with points (x',y') such that (31x+y=31x'+y') which is a very uncommon geometrical property.

edited Mar 20, 2024 at 16:36

answered Apr 20, 2020 at 7:58

Jean-Baptiste Yunès

36.6k4 gold badges51 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Amongalen Over a year ago

In the first case, (3,4) and (4,3) have the same hashcode. In the second case, (0,31) and (1,0) have the same hashcode. Why second is better than the first?

Jean-Baptiste Yunès Over a year ago

@Amongalen because the geometrical property induced by the equivalence function is much more unusual in the second case.

Titulum Over a year ago

@Amongalen, this can be fixed by also multiplying the second item, e.g.: 31 * x + 37 * y

Amongalen Over a year ago

@Titulum and then some other pair of points will collide with each other, eg. (0,31) and (37, 0) :)

Jean-Baptiste Yunès Over a year ago

@Titulum No way. Coding two integers into one ===> collision, whatever you do.

|

chrylis -cautiouslyoptimistic- · Accepted Answer · 2020-04-20 07:59:39Z

6

Imagine you have two string properties prop1 and prop2, and two objects:

A: {prop1="foo", prop2="bar"}
B: {prop1="bar", prop2="foo"}

These are clearly different values, and it's useful to set up the hash code to distinguish between them. If you simply add the properties' hash codes together, you'll get the same value for both A and B. Instead, by multiplying and adding, the hash code will be different based on the property sequence.

It seems you may be slightly misinterpreting the advice: The purpose of the multiply-and-add is to create a dependence on the semantic order of properties within an object, not the execution order of the calculation.

answered Apr 20, 2020 at 7:59

chrylis -cautiouslyoptimistic-

77.5k21 gold badges127 silver badges167 bronze badges

1 Comment

Marichyasana Over a year ago

>>> "foo" * 37 + "bar" 'foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoobar' >>>

Basil Bourque · Accepted Answer · 2024-03-20 18:18:10Z

Jean-Baptiste Yunès’ answer is the correct one, but I will add the following example to illustrate (keep in mind that it's in javascript, only because I quickly implemented this for the example):

class Point {
    constructor(x, y) {
        this.x = x;
        this.y = y;
    }
}

function getHashCollisions(collection, hashFunction) {
    const collisionMap = new Map();
    let count = 1;
    let total = collection.length;
    for (const point of collection) {
        console.log(`calculating ${count++}/${total}`);
        const currentHash = hashFunction(point);
        const hashCount = collisionMap.has(currentHash) ? collisionMap.get(currentHash) +1 : 1;
        collisionMap.set(currentHash, hashCount);
    }
    return collisionMap;
}

function generateDataset(rangeX, rangeY) {
    const points = [];
    let count = 1;
    for (let x = 0; x < rangeX; x++) {
        for (let y = 0; y < rangeY; y++) {
            console.log(`generating ${count++} Point(${x};${y})`);
            points.push(new Point(x, y));
        }
    }
    return points;
}

function calculateAndGenerateReport(dataset, hashFunction, hashFunctionName) {
    const hashes = getHashCollisions(dataset, hashFunction);
    const totalCollisions = Array.from(hashes.values()).filter(currentCollisionCount => currentCollisionCount > 1).length;
    const highestCollisionCount = Array.from(hashes.values()).reduce((currentHighest, current) => current > currentHighest ? current : currentHighest) - 1;
    return `${hashFunctionName}: ${totalCollisions} collisions, highest collision count: ${highestCollisionCount}`;
}

const dataset = generateDataset(100, 100);

const literalHashesReport = calculateAndGenerateReport(dataset, point => point.x + point.y, "literal hash function:");
const onePrimeHashesReport = calculateAndGenerateReport(dataset, point => 31 * point.x + point.y, "one prime multiplication hash function:");
const twoPrimesHashesReport = calculateAndGenerateReport(dataset, point => 31 * point.x + 37 * point.y, "two primes multiplication hash function:");
const twoLargePrimesHashesReport = calculateAndGenerateReport(dataset, point => 8191 * point.x + 131071 * point.y, "two large primes multiplication hash function:");

console.log(literalHashesReport);
console.log(onePrimeHashesReport);
console.log(twoPrimesHashesReport);
console.log(twoLargePrimesHashesReport)

Result:

literal hash function: 197 collisions, highest collision count: 99
one prime multiplication hash function: 3107 collisions, highest collision count: 3
two primes multiplication hash function: 3359 collisions, highest collision count: 2
two large primes multiplication hash function: 0 collisions, highest collision count: 0

This shows that the (prime) numbers we choose to "calculate" the hash greatly decreases the probability for collisions.

Collectives™ on Stack Overflow

Why java hashcode implementation 31 * x + y is better than x + y?

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related