20

I recently came to know that in Java 8 hash maps uses binary tree instead of linked list and hash code is used as the branching factor.I understand that in case of high collision the lookup is reduced to O(log n) from O(n) by using binary trees.My question is what good does it really do as the amortized time complexity is still O(1) and maybe if you force to store all the entries in the same bucket by providing the same hash code for all keys we can see a significant time difference but no one in their right minds would do that.

Binary tree also uses more space than singly linked list as it stores both left and right nodes.Why increase the space complexity when there is absolutely no improvement in time complexity except for some spurious test cases.

4
  • 3
    This is not a debate. I'm quoting: My question is what good does it really do [...] maybe if you [...] we can see a significant time difference but no one in their right minds would do that.. So you're saying it was designed to handle a case no-one one in their right minds would do, hence that the devs explicitely handled a dumb case, hence that the devs are dumb and that you hold the Truth. I suggest rewording your question. Commented Mar 9, 2016 at 10:13
  • 4
    Your question makes no sense. If you assume that hash collisions won’t happen, then the binary tree does not consume more space as the binary tree will only be used when there are hash collisions. To be more precise, the number of collisions must exceed a threshold before the implementation switches from linked list to binary tree. Commented Mar 9, 2016 at 10:15
  • @Tunaki I specifically meant the people deliberately trying that scenario to show the advantage and if I thought there is nothing more to it I would have never asked it as a question. Commented Mar 9, 2016 at 10:17
  • @Hoger Nice catch I didn't know that linked list is also used ! Commented Mar 9, 2016 at 10:21

2 Answers 2

29

This is mostly security-related change. While in normal situation it's rarely possible to have many collisions, if hash keys arrive from untrusted source (e.g. HTTP header names received from the client), then it's possible and not very hard to specially craft the input, so the resulting keys will have the same hashcode. Now if you perform many look-ups, you may experience denial-of-service. It appears that there's quite a lot of code in the wild which is vulnerable to this kind of attacks, thus it was decided to fix this on the Java side.

For more information refer to JEP-180.

Sign up to request clarification or add additional context in comments.

6 Comments

Agreed but how can the user craft inputs to increase collision without knowing the underlying hash function ?
@user1613360, hash function for String is well-known and documented, so it's not a big problem. Every Java implementation must use the same function.
Besides the fact that certain hash codes are fixed and well documented, an attacker can always try to find out which particular version a server is running and look how the hash code of a particular class implementation is calculated. That’s how most attacks work, first, find out which software runs, then, try an appropriate exploit tailored to the specific software and version.
@AnmolSinghJaggi it won't work if the hashCode is exactly the same. And in case of targeted DoS attack it's quite easy to generate e.g. string keys with exactly the same hashCode.
|
26

Your question contains some wrong premises.

  1. A bucket collision is not necessarily a hash collision. You don’t need to have the same hash code for two objects to end up at the same bucket. The bucket is an element of an array and the hash code has to be mapped to a particular index. Since the array size should be reasonable in respect to the Map’s size, you can’t raise the array size arbitrarily to avoid bucket collisions. There’s even the theoretical limitation that an array size can be at max 2³¹ while there are 2³² possible hash codes.
  2. Having a hash collision is not a sign of bad programming. For all objects having a value space larger than 2³², the possibility of distinct objects with the same hash code is unavoidable. Strings are an obvious example, but even a Point bearing two int values or a plain Long key have unavoidable hash collisions. So they might be more common than you think and it heavily depends on the use case.
  3. The implementation switches to a binary tree only when the number of collisions in a bucket exceeds a certain threshold so the higher memory costs only apply when they will pay off. there seems to be a common misunderstanding regarding, how they work. Since bucket collision are not necessarily hash collisions, the binary search will first search the hash codes. Only when the hash codes are identical and the key appropriately implements Comparable, its natural order will be used. The examples you might have found on the web deliberately use the same hash code for objects to demonstrate the use of the Comparable implementation which otherwise would not show up. What they trigger is only the last resort of the implementation.
  4. As Tagir pointed out, this issue might affect the security of a software as a slow fallback may open the possibility of DoS attacks. There have been several attempts in previous JRE versions to address this issue which had more drawbacks than the memory consumption of a binary tree. For example, there was an attempt to randomize the mapping of hash codes to array entries in Java 7 which caused an initialization overhead as documented in this bug report. This is not the case with this new implementation.

4 Comments

Pointing out that "bucket collision" is not necessarily same as "hash collision" is important here. Many programmers mistakeably think they are synonymous.
I have one doubt in this. For Hashmap, its not mandatory for key to implement comparable. So, are we assuming that comparable is implemented by the key when we are switching to binary tree based hashMap.
@Akanksha perhaps, you didn’t read this answer carefully enough, especially the 3rd point. A high number of bucket collisions is solved by turning the affected bucket to a binary tree over the hash codes. This does not require the keys to be comparable, the hash codes are. Only for a significantly high number of actual hash collisions, i.e. same hash codes, a binary tree using the comparable implementation is used, if the keys are comparable, otherwise not.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.