An algorithm to search for words in an array of letters

Question

So I have an Array, and I have to search for words

The Array:

        0   1   2   3   4   5   6   7   8   9   10  11
text    g   t   c   a   n   d   l   e   t   j   a   q

The Keys:

2 can
2 candle
3 a
3 an
3 and
6 let
10 a

The number is an offset from the start of the array being searched, and the string is a word from the dictionary that was found at that offset. Note that several words can start at the same offset, and the same word can be found in multiple places. Note also that words can overlap.

This is the code I have written:

public ArrayList<Location> findWords(String[] dictionary, String text) {
    int keyLength = text.length();
    int dtLength = dictionary.length;

    ArrayList<Location> results;
    results = new ArrayList<>();

    for (int k = 0; k < keyLength; k++) {
        for (int d = 0; d < dtLength; d++) {
            if (dthasKey(dictionary[d], text, k)) {
                Location loc = new Location(k, dictionary[d]);
                results.add(loc);
            }
        }
    }
    return results;
}

private boolean dthasKey(String key, String text, int pos) {
    for (int i = 0; i < key.length(); i++) {
        if (key.length() >= text.length() - pos)
            return false;
        while (key.charAt(i) != text.charAt(pos + i)) {
            return false;
        }
    }
    return true;
}

I was wondering if there is a better solution to this problem. If you guys may also provide a worst time complexity, that would be great. The one I have written is: O(k*n*m) where m is the size of the dictionary, n is the size of the text, and k is the length of the longest word.

Have the dictionary as HashSet<String> and so you have O(k*n) — Dmitrii Bychenko
– Dmitrii Bychenko, Commented Jun 1, 2015 at 13:47
Your worst time complexity if going to be to search for every possible word through the entire array (e.g. entire string, entire string minus 1 letter, then two, etc...) and iterate through each possible combination checking if its a word. — Sh4d0wsPlyr
– Sh4d0wsPlyr, Commented Jun 1, 2015 at 13:48
I donot know the complexity of Regex, but I think you can try it docs.oracle.com/javase/8/docs/api/java/util/regex/… — schlagi123
– schlagi123, Commented Jun 1, 2015 at 13:50
The standard solution is the Aho-Corasick string matching algorithm, which builds an automaton from the dictionary (a one-time cost), and then can quickly find all words in a string that you pass to it. A Google search reveals many Java implementations. — Jim Mischel
– Jim Mischel, Commented Jun 1, 2015 at 15:11
The Aho-Corasick search algorithm has time complexity O(n + z), where n is the length of the text to be searched, and z is the number of matches found. That's after the tree is constructed. Construction of the automaton is O(n), but with a reasonably high constant. However, it's a one-time cost. That is, you construct the automaton once from the dictionary, and use that automaton to search multiple strings or documents. — Jim Mischel
– Jim Mischel, Commented Jun 1, 2015 at 15:24

Jim Mischel · Accepted Answer · 2015-06-01 23:41:35Z

3

The standard solution to your problem is to use the Aho-Corasick string matching algorithm, which builds an automaton from the dictionary and then can quickly find all words in a string that you pass to it. A Google search reveals many Java implementations.

Building the automaton is O(n), where n is the number of characters in all words of the dictionary. But it's a one-time cost. You can use that automaton to search multiple documents for the words.

Searching documents for words is O(m + z), where m is the number of characters in the document, and z is the number of matches found.

I don't know if Aho-Corasick is the fastest algorithm, but it is very fast. And that there are existing Java implementations would be a big plus. But it's not especially difficult to implement. The original paper, Efficient String Matching: an Aid to Bibliographic Search, is very readable although it'll probably take a couple of iterations of reading, pondering, and reading again before it "clicks." And the pseudocode examples are detailed enough that you can use them as the basis of an implementation. I created a C# implementation for an article using that document as my only reference.

answered Jun 1, 2015 at 23:41

Jim Mischel

135k25 gold badges197 silver badges377 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Mohammad Sajjad Over a year ago

Thank you @Jim. This really helps :) O(n) should be sufficient :D

Ofek Ron · Accepted Answer · 2015-06-01 13:51:41Z

0

You can create an automaton for each of the words (that accepts only that word) and then run the given text through all the automatons simolteanously, this will end up O(m*k^2+n)

answered Jun 1, 2015 at 13:51

Ofek Ron

8,60813 gold badges61 silver badges106 bronze badges

Collectives™ on Stack Overflow

An algorithm to search for words in an array of letters

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related