Implementing efficient data structure using Arrays only

Question

As part of my programming course I was given an exercise to implement my own String collection. I was planning on using ArrayList collection or similar but one of the constraints is that we are not allowed to use any Java API to implement it, so only arrays are allowed. I could have implemented this using arrays however efficiency is very important as well as the amount of data that this code will be tested with. I was suggested to use hash tables or ordered tress as they are more efficient than arrays. After doing some research I decided to go with hash tables because they seemed easy to understand and implement but once I started writing code I realised it is not as straight forward as I thought.

So here are the problems I have come up with and would like some advice on what is the best approach to solve them again with efficiency in mind:

ACTUAL SIZE: If I understood it correctly hash tables are not ordered (indexed) so that means that there are going to be gaps in between items because hash function gives different indices. So how do I know when array is full and I need to resize it?
RESIZE: One of the difficulties that I need to create a dynamic data structure using arrays. So if I have an array String[100] once it gets full I will need to resize it by some factor I decided to increase it by 100 each time so once I would do that I would need to change positions of all existing values since their hash keys will be different as the key is calculated:

int position = "orange".hashCode() % currentArraySize;

So if I try to find a certain value its hash key will be different from what it was when array was smaller.

HASH FUNCTION: I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
DEALING WITH MULTIPLE OCCURRENCES: one of the requirements is to be able to add multiple words that are the same, because I need to be able to count how many times the word is stored in my collection. Since they are going to have the same hash code I was planning to add the next occurrence at the next index hoping that there will be a gap. I don't know if it is the best solution but here how I implemented it:

public int count(String word) {
    int count = 0;
    while (collection[(word.hashCode() % size) + count] != null && collection[(word.hashCode() % size) + count].equals(word))
        count++;
    return count;
}

Thank you in advance for you advice. Please ask anything needs to be clarified.

P.S. The length of words is not fixed and varies greatly.

UPDATE Thank you for your advice, I know I did do few stupid mistakes there I will try better. So I took all your suggestions and quickly came up with the following structure, it is not elegant but I hope it is what you roughly what you meant. I did have to make few judgements such as bucket size, for now I halve the size of elements, but is there a way to calculate or some general value? Another uncertainty was as to by what factor to increase my array, should I multiply by some n number or adding fixed number is also applicable? Also I was wondering about general efficiency because I am actually creating instances of classes, but String is a class to so I am guessing the difference in performance should not be too big?

Did you look at the wikipedia page about hash tables? It describes multiple strategies for resizing and addressing. — that other guy
– that other guy, Commented Dec 9, 2015 at 22:38

Louis Wasserman · Accepted Answer · 2015-12-09 22:31:29Z

1

ACTUAL SIZE: The built-in Java HashMap just resizes when the total number of elements exceeds the number of buckets multiplied by a number called the load factor, which is by default 0.75. It does not take into account how many buckets are actually full. You don't have to, either.

RESIZE: Yes, you'll have to rehash everything when the table is resized, which does include recomputing its hash.

So if I try to find a certain value it's hash key will be different from what it was when array was smaller.

Yup.

HASH FUNCTION: Yes, you should use the built in hashCode() function. It's good enough for basic purposes.

DEALING WITH MULTIPLE OCCURRENCES: This is complicated. One simple solution would just be to have the hash entry for a given string also keep count of how many occurrences of that string are present. That is, instead of keeping multiple copies of the same string in your hash table, keep an int along with each String counting its occurrences.

answered Dec 9, 2015 at 22:31

Louis Wasserman

200k26 gold badges367 silver badges437 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AlenEviLL Over a year ago

Thank you for your answer! But what is the way to determine bucket size?

Louis Wasserman Over a year ago

You're not determining the bucket size, you're just determining the number of buckets, which is just the size of the array you're using inside your hash table.

Peter Lawrey · Accepted Answer · 2015-12-09 22:35:05Z

1

So how do I know when array is full and I need to resize it?

You keep track of the size and HashMap does. When the size used > capacity * load factor you grow the underlying array, either as a whole or in part.

int position = "orange".hashCode() % currentArraySize;

Some things to consider.

The % of a negative value is a negative value.
Math.abs can return a negative value.
Using & with a bit mask is faster however you need a size which is a power of 2.

I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.

The built in hashCode is cached, so it is fast. However it is not a great hashCode and has poor randomness for lower bit, and higher bit for short strings. You might want to implement your own hashing strategy, possibly a 64-bit one.

DEALING WITH MULTIPLE OCCURRENCES:

This is usually done with a counter for each key. This way you can have say 32767 duplicates (if you use short) or 2 billion (if you use int) duplicates of the same key/element.

answered Dec 9, 2015 at 22:35

Peter Lawrey

535k83 gold badges770 silver badges1.2k bronze badges

4 Comments

Dima Over a year ago

" You might want to implement your own hashing strategy, possibly a 64-bit one." Really? The guy doesn't know how to tell the array is full ...

Peter Lawrey Over a year ago

@Dima true, he/she only just "realised it is not as straight forward as I thought." and hasn't even asked about linear addressing vs link lists or trees of values, but will learn and someone who understands what that means might read the answer.

AlenEviLL Over a year ago

@PeterLawrey Thank you for answering. I think some things are out of scope of this exercise and my knowledge at the moment, but I will look into those topics in the future. Does HashMap also keep a counter for each element? So if I remove an element that has say 3 occurences I just decrease the counter until it is only 1 to remove it completely?

Peter Lawrey Over a year ago

@AlenEviLL HashMap keeps a counter for each key. It doesn't allow duplicates so you would need to keep a counter for the number of instances of a key. In any case it counts the number of unique keys.

Collectives™ on Stack Overflow

Implementing efficient data structure using Arrays only

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related