0

I am currently working in a java application using a lot of strings (+2000). I want to store these strings in an proper structure, so when i want to store a new string, i can check in a fast way if a same string was already there. If no same string was in the structure, i proceed to store the new one (basically store without repeating strings.).

//PSEUDOCODE
private ?????? myCollectionOfStrings;

public void store_If_Not_Exist(String aNewString){
    if (!exist_in_Collection(aNewString)){ //this must be fast.
        store_in_Collection(aNewString);
    }
}

I am currently using a naive implementation, but i know that is really inefficient:

private List<String> myCollectionOfStrings;

public void store_If_Not_Exist(String aNewString){
    boolean existInCollection = false;

    for (String s: myCollectionOfStrings){
        if (s.equals(aNewString)){
            existInCollection = true;
            break;
        }
    }

    if(!existInCollection)
        store_in_Collection(aNewString);
}

The question is : What kind of method/Structure/Algorithm can i use to store the strings, so the check for existence can be implemented in a fast way? Maybe a Trie Tree, or a HashMap???. Thanks!

5
  • 4
    Use a Set<String>. But anything that looks up by hashcode is relatively efficient. 2000 is not that large. I assume, of course, that you are looking for a direct match and not things like stemming, plurals, etc. Actually, using Set would allow by-passing the check, since only one instance will be present. Commented Apr 22, 2016 at 16:45
  • 5
    You are looking for a Set data structure. In Java, HashSet. It has O(1) lookup time for an element. Commented Apr 22, 2016 at 16:45
  • Use a HashSet, which is very fast Commented Apr 22, 2016 at 16:58
  • In most environments, 2000 strings isn't a lot. I would concentrate on the functionality rather than performance: you need a collection that stores different items only. And that's a Set. Commented Apr 22, 2016 at 18:16
  • Thanks for responses!!! , i will use a HashSet for my problem. Commented Apr 22, 2016 at 22:10

1 Answer 1

2

If maintaining the words in alphabetical order is not important, then just use a HashSet. It allows you to retrieve any element in O(1) and you can just add the word to the set without worrying about creating duplicates.

The only problem with hash collections is that the don't maintain a natural order when you iterate over them. In other words, a HashSet will not print your words in alphabetical order.

If order is critical to your application, my suggestion is that you use either a TreeMap or a Trie. They both share some characteristics and the basic structure, but a Trie is optimized for strings.

If you don't want to overcomplicate things, use the TreeMap that is part of the collections framework.

But if you want to go the extra mile in your path to efficiency, then the Data structure that you are looking for is a Trie.

https://en.wikipedia.org/wiki/Trie

In summary, a Trie is a data Structure that allows you to store strings in alphabetical order. It is very powerful because it allows you detect that a word is missing very quickly.

Imagine that you want to check for the existence of the word "foo", and if it is not in your Tree, you want to add it.

As you can see in the wikipedia article, the root node of the Trie contains an empty String. Your first action to determine if the word foo is in the Trie would be to check if the root node has a child node with the string "f". If it does NOT, you already know that the word is not in your Trie, and you only did an operation.

If, on the other hand, the root node has a child with the string "f", then you have to check if this node has a child with the string "fo", if it doesn't, your word is not in the trie. If it does, you finally check if the "fo" node has a child named "foo".

To sum up, a Trie is exactly what you are looking for and it will allow you to efficiently insert and check for the existence of words while maintaining its natural order.

In this forum post you can see an implementation of a trie so you don't have to reinvent the wheel.

https://community.oracle.com/thread/2070706

TO SUM UP:

  • I don't care about maintaining a particular order: use a HashSet
  • I care about maintaining a the words in alphabetical order and I want a simple solution, even if it is not the most efficient: use a TreeMap
  • I need to maintain alphabetical order and performance is critical: use a Trie.
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks!!, that was pretty informative. I don't care about order, so i will use the HashSet.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.