Efficient String text search

Question

I'd like to create a method which searches a small String of text (usually no more than 256 characters) for the existence of any of about 20 different words. If it finds one in the text regardless of case it returns a true.

The method will be executed a quite a bit (not a crazy amount) so it has to be as efficient as possible. What do you suggest would be best here?

The 20 words do not change. They are static. But the text to scan does.

I am not able to understand your requirement clearly, u mean to say you want to search whether given word is present in String that contain 20 words? — Jayesh
– Jayesh, Commented Jul 31, 2013 at 11:37
Agree with @Fraser. Regular expression is the most efficient. Compile once, use many times. — andy256
– andy256, Commented Jul 31, 2013 at 11:39
A regular expression is compiled to a finite state machine that matches the data in one pass. — andy256
– andy256, Commented Jul 31, 2013 at 11:54

Óscar López · Accepted Answer · 2013-07-31 11:37:46Z

5

I'd suggest: add all the words in the input text to a Set - it's only 256 characters after all, and adding them is an O(n) operation.

After that you can test each of the 20 or so words for membership using the contains() operation of the Set, which is O(1).

answered Jul 31, 2013 at 11:37

Óscar López

237k38 gold badges321 silver badges391 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Joni · Accepted Answer · 2013-07-31 11:40:29Z

3

Since the 20 words to search don't change, one of the fastest ways to look for them is compiling a regular expression that matches them and reuse it on different inputs. The complexity of matching a regular expression to a given string is linear to the string length for simple regular expressions that don't require backtracking. In your case the length is bounded, so it's O(1).

answered Jul 31, 2013 at 11:40

Joni

112k14 gold badges151 silver badges201 bronze badges

5 Comments

Anirudha Over a year ago

no! it won't be O(1)...indexof or hash would be better then using regex even if regex remains same

andy256 Over a year ago

Answers that suggest a set assume that the input string is composed of discrete words. The way I read the OP, that may or may not be the case. Perhaps the OP can clarify?

Joni Over a year ago

@Anirudh, since everything here has an upper bound (20 words, 256 chars, ...) the execution time of any decent algorithm will also be bounded by a constant, therefore, O(1). Why do you say indexOf or a hash table would be better, better in what sense?

Anirudha Over a year ago

@Joni we should test it instead..regex vs hash lookup vs indexof

devrobf Over a year ago

I'd agree with @andy256, it would be nice for the OP to answer that - I wrote my answer making that assumption :)

devrobf · Accepted Answer · 2013-07-31 11:37:53Z

2

The String class already has lots of methods to do these sorts of things. For example, the indexOf method will solve your problem:

String str = "blahblahtestblah";
int result = str.indexOf("test");

result will contain -1 if the string does not contain the word "test". I'm not sure if this is efficient enough for you but I would start here as it's been implemented already!

answered Jul 31, 2013 at 11:37

devrobf

7,2533 gold badges36 silver badges51 bronze badges

Comments

fge · Accepted Answer · 2013-07-31 11:39:22Z

2

Assuming these 20 words are in a Set<String> and all are lowercase, then it is as easy as:

public final boolean containsWord(final String input)
{
    final String s = input.toLowerCase();
    for (final String word: wordSet)
        if (s.indexOf(word) != -1)
            return true;
    return false;
}

answered Jul 31, 2013 at 11:39

fge

122k35 gold badges266 silver badges340 bronze badges

Comments

rossum · Accepted Answer · 2013-07-31 11:51:08Z

1

If you want to search for a number of different targets simultaneously, then the Rabin-Karp algorithm is a possibility. If is especially efficient if there are only a few different word lengths in your list of 20 targets. One single pass through the string will find all the matches of a given length.

answered Jul 31, 2013 at 11:51

rossum

15.7k2 gold badges26 silver badges40 bronze badges

Comments

MaVVamaldo · Accepted Answer · 2013-07-31 11:42:20Z

0

I'd do the following:

String longStr //the string to search into
ArrayList<String> words; //the words to check

Iterator<String> iter = words.iterator();
while(iter.hasNext())
{
    if(longStr.contains(iter.next()))
        return true;    
}
return false;

answered Jul 31, 2013 at 11:42

MaVVamaldo

2,5657 gold badges28 silver badges50 bronze badges

2 Comments

Anirudha Over a year ago

why not use for loop!

MaVVamaldo Over a year ago

while loop is as efficient as for loop.

Michael Cheremuhin · Accepted Answer · 2013-07-31 11:47:34Z

0

You can get all the words to a List, sort it and use Collections.binarySearch(...). You will loose on sorting, but the binarySearch is log(n).

answered Jul 31, 2013 at 11:47

Michael Cheremuhin

1,40312 silver badges17 bronze badges

Comments

Eurig Jones · Accepted Answer · 2013-07-31 15:48:04Z

0

Ok. Thanks for answering and commenting everybody. I realise that the question I asked can have broad and varied answers. But this is what I ended up using because the performance was very important so using standard Collections just won't cut the mustard.

I used a "Patricia Trie" structure which is a very powerful and elegant datastructure capable of offering low memory overheads and extremely fast search speeds.

If anyone is interested, there is a video here briefly explaining how a Patricia Trie works. You will realise why it's so performant after watching. Also there is a Java implementation of the data structure on github here.

edited Jul 31, 2013 at 15:48

answered Jul 31, 2013 at 15:42

Eurig Jones

8,5838 gold badges55 silver badges74 bronze badges

Collectives™ on Stack Overflow

Efficient String text search

8 Answers 8

Comments

5 Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Comments

5 Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related