2

So I have an Array, and I have to search for words

The Array:

        0   1   2   3   4   5   6   7   8   9   10  11
text    g   t   c   a   n   d   l   e   t   j   a   q

The Keys:

2 can
2 candle
3 a
3 an
3 and
6 let
10 a

The number is an offset from the start of the array being searched, and the string is a word from the dictionary that was found at that offset. Note that several words can start at the same offset, and the same word can be found in multiple places. Note also that words can overlap.

This is the code I have written:

public ArrayList<Location> findWords(String[] dictionary, String text) {
    int keyLength = text.length();
    int dtLength = dictionary.length;

    ArrayList<Location> results;
    results = new ArrayList<>();

    for (int k = 0; k < keyLength; k++) {
        for (int d = 0; d < dtLength; d++) {
            if (dthasKey(dictionary[d], text, k)) {
                Location loc = new Location(k, dictionary[d]);
                results.add(loc);
            }
        }
    }
    return results;
}

private boolean dthasKey(String key, String text, int pos) {
    for (int i = 0; i < key.length(); i++) {
        if (key.length() >= text.length() - pos)
            return false;
        while (key.charAt(i) != text.charAt(pos + i)) {
            return false;
        }
    }
    return true;
}

I was wondering if there is a better solution to this problem. If you guys may also provide a worst time complexity, that would be great. The one I have written is: O(k*n*m) where m is the size of the dictionary, n is the size of the text, and k is the length of the longest word.

11
  • 2
    Have the dictionary as HashSet<String> and so you have O(k*n) Commented Jun 1, 2015 at 13:47
  • Your worst time complexity if going to be to search for every possible word through the entire array (e.g. entire string, entire string minus 1 letter, then two, etc...) and iterate through each possible combination checking if its a word. Commented Jun 1, 2015 at 13:48
  • I donot know the complexity of Regex, but I think you can try it docs.oracle.com/javase/8/docs/api/java/util/regex/… Commented Jun 1, 2015 at 13:50
  • 1
    The standard solution is the Aho-Corasick string matching algorithm, which builds an automaton from the dictionary (a one-time cost), and then can quickly find all words in a string that you pass to it. A Google search reveals many Java implementations. Commented Jun 1, 2015 at 15:11
  • 1
    The Aho-Corasick search algorithm has time complexity O(n + z), where n is the length of the text to be searched, and z is the number of matches found. That's after the tree is constructed. Construction of the automaton is O(n), but with a reasonably high constant. However, it's a one-time cost. That is, you construct the automaton once from the dictionary, and use that automaton to search multiple strings or documents. Commented Jun 1, 2015 at 15:24

2 Answers 2

3

The standard solution to your problem is to use the Aho-Corasick string matching algorithm, which builds an automaton from the dictionary and then can quickly find all words in a string that you pass to it. A Google search reveals many Java implementations.

Building the automaton is O(n), where n is the number of characters in all words of the dictionary. But it's a one-time cost. You can use that automaton to search multiple documents for the words.

Searching documents for words is O(m + z), where m is the number of characters in the document, and z is the number of matches found.

I don't know if Aho-Corasick is the fastest algorithm, but it is very fast. And that there are existing Java implementations would be a big plus. But it's not especially difficult to implement. The original paper, Efficient String Matching: an Aid to Bibliographic Search, is very readable although it'll probably take a couple of iterations of reading, pondering, and reading again before it "clicks." And the pseudocode examples are detailed enough that you can use them as the basis of an implementation. I created a C# implementation for an article using that document as my only reference.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @Jim. This really helps :) O(n) should be sufficient :D
0

You can create an automaton for each of the words (that accepts only that word) and then run the given text through all the automatons simolteanously, this will end up O(m*k^2+n)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.