2

I'd like to create a method which searches a small String of text (usually no more than 256 characters) for the existence of any of about 20 different words. If it finds one in the text regardless of case it returns a true.

The method will be executed a quite a bit (not a crazy amount) so it has to be as efficient as possible. What do you suggest would be best here?

The 20 words do not change. They are static. But the text to scan does.

10
  • And where are those 20 words located? Commented Jul 31, 2013 at 11:37
  • I am not able to understand your requirement clearly, u mean to say you want to search whether given word is present in String that contain 20 words? Commented Jul 31, 2013 at 11:37
  • 3
    Have you tried any regular expressions? Commented Jul 31, 2013 at 11:38
  • Agree with @Fraser. Regular expression is the most efficient. Compile once, use many times. Commented Jul 31, 2013 at 11:39
  • 2
    A regular expression is compiled to a finite state machine that matches the data in one pass. Commented Jul 31, 2013 at 11:54

8 Answers 8

5

I'd suggest: add all the words in the input text to a Set - it's only 256 characters after all, and adding them is an O(n) operation.

After that you can test each of the 20 or so words for membership using the contains() operation of the Set, which is O(1).

Sign up to request clarification or add additional context in comments.

Comments

3

Since the 20 words to search don't change, one of the fastest ways to look for them is compiling a regular expression that matches them and reuse it on different inputs. The complexity of matching a regular expression to a given string is linear to the string length for simple regular expressions that don't require backtracking. In your case the length is bounded, so it's O(1).

5 Comments

no! it won't be O(1)...indexof or hash would be better then using regex even if regex remains same
Answers that suggest a set assume that the input string is composed of discrete words. The way I read the OP, that may or may not be the case. Perhaps the OP can clarify?
@Anirudh, since everything here has an upper bound (20 words, 256 chars, ...) the execution time of any decent algorithm will also be bounded by a constant, therefore, O(1). Why do you say indexOf or a hash table would be better, better in what sense?
@Joni we should test it instead..regex vs hash lookup vs indexof
I'd agree with @andy256, it would be nice for the OP to answer that - I wrote my answer making that assumption :)
2

The String class already has lots of methods to do these sorts of things. For example, the indexOf method will solve your problem:

String str = "blahblahtestblah";
int result = str.indexOf("test");

result will contain -1 if the string does not contain the word "test". I'm not sure if this is efficient enough for you but I would start here as it's been implemented already!

Comments

2

Assuming these 20 words are in a Set<String> and all are lowercase, then it is as easy as:

public final boolean containsWord(final String input)
{
    final String s = input.toLowerCase();
    for (final String word: wordSet)
        if (s.indexOf(word) != -1)
            return true;
    return false;
}

Comments

1

If you want to search for a number of different targets simultaneously, then the Rabin-Karp algorithm is a possibility. If is especially efficient if there are only a few different word lengths in your list of 20 targets. One single pass through the string will find all the matches of a given length.

Comments

0

I'd do the following:

String longStr //the string to search into
ArrayList<String> words; //the words to check

Iterator<String> iter = words.iterator();
while(iter.hasNext())
{
    if(longStr.contains(iter.next()))
        return true;    
}
return false;

2 Comments

why not use for loop!
while loop is as efficient as for loop.
0

You can get all the words to a List, sort it and use Collections.binarySearch(...). You will loose on sorting, but the binarySearch is log(n).

Comments

0

Ok. Thanks for answering and commenting everybody. I realise that the question I asked can have broad and varied answers. But this is what I ended up using because the performance was very important so using standard Collections just won't cut the mustard.

I used a "Patricia Trie" structure which is a very powerful and elegant datastructure capable of offering low memory overheads and extremely fast search speeds.

If anyone is interested, there is a video here briefly explaining how a Patricia Trie works. You will realise why it's so performant after watching. Also there is a Java implementation of the data structure on github here.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.