Faster String Matching/Iteration Method?

Question

In the program I'm currently working on, there's one part that's taking a bit long. Basically, I have a list of Strings and one target phrase. As an example, let's say the target phrase is "inventory of finished goods". Now, after filtering out the stop word (of), I want to extract all Strings from the list that contains one of the three words: "inventory", "finished", and "goods". Right now, I implemented the idea as follows:

String[] targetWords; // contains "inventory", "finished", and "goods"
ArrayList<String> extractedStrings = new ArrayList<String>();

for (int i = 0; i < listOfWords.size(); i++) {
    String[] words = listOfWords.get(i).split(" ");
    outerloop:
    for (int j = 0; j < words.length; j++) {
        for (int k = 0; k < targetWords.length; k++) {
            if (words[j].equalsIgnoreCase(targetWords[k])) {
                extractedStrings.add(listOfWords.get(i));
                break outerloop;
            }
        }
    }
}

The list contains over 100k words, and with this it takes rounghly .4 to .8 seconds to complete the task for each target phrase. The things is, I have a lot of these target phrases to process, and the seconds really add up. Thus, I was wondering if anyone knew of a more efficient way to complete this task? Thanks for the help in advance!

This is O(N^3). You can cut it down to O(N^2) by using a HashMap instead of the inner loop. But I'm puzzled by the loop on j. Why isn't your list of words already a list of words? Why do you have to split each item again? — user207421
– user207421, Commented Aug 9, 2013 at 0:43
Sorry, I should of named the variable better - listOfWords actually contains phrases, so I split the phrases to get each individual word in each phrase. — myrocks2
– myrocks2, Commented Aug 9, 2013 at 18:14

MattR · Accepted Answer · 2013-08-09 00:40:41Z

6

Your list of 100k words could be added (once) to a HashSet. Rather than iterating through your list, use wordSet.contains() - a HashSet gives constant-time performance for this, so not affected by the size of the list.

edited Aug 9, 2013 at 0:40

answered Aug 9, 2013 at 0:33

MattR

7,0482 gold badges24 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

denov Over a year ago

i think his words are phrases and not words so contains wouldn't work to find a string in string.

MattR Over a year ago

@denov OK, might need a more complex structure like HashMap<String,Set<String>> - key is to do the pre-processing once (instead of splitting constant data in each loop) and try to avoid iteration.

Ralph Caraveo · Accepted Answer · 2013-08-09 00:39:59Z

2

You can take your giant list of words and add them to a hash map and then when your phrase comes in, just loop over the words in your phrase and check against the hash map. Currently you are doing a linear search and what I'm proposing would cut it down to a constant time search.

The key is minimizing lookups. Using this technique you would be effectively indexing your giant list of words for fast lookups.

edited Aug 9, 2013 at 0:39

answered Aug 9, 2013 at 0:34

Ralph Caraveo

10.3k7 gold badges42 silver badges53 bronze badges

Comments

nd. · Accepted Answer · 2013-08-09 20:56:35Z

1

You are passing trough each of the elements from targetWords, instead of checking for all words from targetWords simultaneously. In addition, you are splitting your list of words in each iteration without really needing it, creating overhead.

I would suggest that you combine your targetWords into one (compiled) regular expression:

(?xi)  # turn on comments, use case insensitive matching
\b     # word boundary, i.e. start/end of string, whitespace
(      # begin of group containing 'inventory' or 'finished' or 'goods'
 inventory|finished|goods  # bar separates alternatives
)      # end of group
\b     # word boundary

Don't forget to double-quote the backspaces in your regular expression string.

import java.util.regex.*;
...
Pattern targetPattern = Pattern.compile("(?xi)\\b(inventory|finished|goods)\\b");
for (String singleString : listOfWords) {
  if (targetPattern.matcher(singleString).find()) {
    extractedStrings.add(singleString);
  }
}

If you are not satisfied with the speed of regular expressions - although regular expression engines are usually optimized for performance - you need to roll your own high-speed multi-string search. The Aho–Corasick string matching algorithm is optimized for searching several fixed strings in text, but of course implementing this algorithm is quite some effort compared with simply creating a Pattern.

edited Aug 9, 2013 at 20:56

answered Aug 9, 2013 at 20:51

nd.

8,9522 gold badges34 silver badges45 bronze badges

4 Comments

myrocks2 Over a year ago

This is actually really clever. I like it! +1

denov Over a year ago

i'm curious to see if this is any faster on very large lists, lists with very long strings, and if you're going to need to do multiple looks compared to my answer using a HashMap for lookups. somebody want to write a test??

nd. Over a year ago

@denov tested with War and Peace gutenberg.org/ebooks/2600 which contains 65007 lines. targetWords were as in the question. When timing it by simply checking currentTimeMillis, I get 350ms for the HashMap-based solution, 200ms for the regular expression solution, with the regex first (so the VM is still warming up). When switching HashMap before regex, its 390ms HashMap vs. 160ms regex. I did not measure the memory footprint (which should be higher for the HashMap solution as well).

denov Over a year ago

@nd. thanks for coding up a test, did you look for a single needle or multiple needles in the haystack. yes, i'm sure the HashMap uses more memory.

denov · Accepted Answer · 2013-08-09 19:56:39Z

1

I'm a little confused to if you want the whole phrase or just single words from listOfWords. If you are trying to get the string from listOfWords if one of your target words is in the string this should work for you.

    String[] targetWords= new String[]{"inventory", "finished", "goods"};
    List<String> listOfWords = new ArrayList<String>();

    // build lookup map
    Map<String, ArrayList<String>> lookupMap = new HashMap<String, ArrayList<String>>();
    for(String words : listOfWords) {
        for(String word : words.split(" ")) {
            if(lookupMap.get(word) == null) lookupMap.put(word, new ArrayList<String>());
            lookupMap.get(word).add(words);
        }
    }

    // find phrases
    Set<String> extractedStrings = new HashSet<String>();
    for(String target : targetWords) {
        if(lookupMap.containsKey(target)) extractedStrings.addAll(lookupMap.get(target));
    }

edited Aug 9, 2013 at 19:56

answered Aug 9, 2013 at 1:20

denov

13k2 gold badges32 silver badges45 bronze badges

2 Comments

myrocks2 Over a year ago

Sorry for the confusion, listOfWords contains phrases and I split them up to get the individual words so I can compare them against the words in the target phrase. If I'm not mistaken, wouldn't your solution potentially create duplicates for phrases that have more than one word matching? For example, assuming targetWords is the same, if I come across the phrase "inventory of goods", extractedStrings will end up containing that phrase twice since both the words "inventory" and "goods" are in the lookupMap? Should I just iterate through and remove all duplicates afterwards?

denov Over a year ago

I updated my code so extractedStrings is a Set so you won't have dups.

Alex · Accepted Answer · 2013-08-09 00:25:21Z

0

I would try to implement it with ExecutorService to parallelize search for each word. http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html

For example with fixed thread pool size:

Executors.newFixedThreadPool(20);

answered Aug 9, 2013 at 0:25

Alex

11.6k7 gold badges39 silver badges52 bronze badges

Collectives™ on Stack Overflow

Faster String Matching/Iteration Method?

5 Answers 5

2 Comments

Comments

4 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

4 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related