6

In the program I'm currently working on, there's one part that's taking a bit long. Basically, I have a list of Strings and one target phrase. As an example, let's say the target phrase is "inventory of finished goods". Now, after filtering out the stop word (of), I want to extract all Strings from the list that contains one of the three words: "inventory", "finished", and "goods". Right now, I implemented the idea as follows:

String[] targetWords; // contains "inventory", "finished", and "goods"
ArrayList<String> extractedStrings = new ArrayList<String>();

for (int i = 0; i < listOfWords.size(); i++) {
    String[] words = listOfWords.get(i).split(" ");
    outerloop:
    for (int j = 0; j < words.length; j++) {
        for (int k = 0; k < targetWords.length; k++) {
            if (words[j].equalsIgnoreCase(targetWords[k])) {
                extractedStrings.add(listOfWords.get(i));
                break outerloop;
            }
        }
    }
}

The list contains over 100k words, and with this it takes rounghly .4 to .8 seconds to complete the task for each target phrase. The things is, I have a lot of these target phrases to process, and the seconds really add up. Thus, I was wondering if anyone knew of a more efficient way to complete this task? Thanks for the help in advance!

2
  • 2
    This is O(N^3). You can cut it down to O(N^2) by using a HashMap instead of the inner loop. But I'm puzzled by the loop on j. Why isn't your list of words already a list of words? Why do you have to split each item again? Commented Aug 9, 2013 at 0:43
  • Sorry, I should of named the variable better - listOfWords actually contains phrases, so I split the phrases to get each individual word in each phrase. Commented Aug 9, 2013 at 18:14

5 Answers 5

6

Your list of 100k words could be added (once) to a HashSet. Rather than iterating through your list, use wordSet.contains() - a HashSet gives constant-time performance for this, so not affected by the size of the list.

Sign up to request clarification or add additional context in comments.

2 Comments

i think his words are phrases and not words so contains wouldn't work to find a string in string.
@denov OK, might need a more complex structure like HashMap<String,Set<String>> - key is to do the pre-processing once (instead of splitting constant data in each loop) and try to avoid iteration.
2

You can take your giant list of words and add them to a hash map and then when your phrase comes in, just loop over the words in your phrase and check against the hash map. Currently you are doing a linear search and what I'm proposing would cut it down to a constant time search.

The key is minimizing lookups. Using this technique you would be effectively indexing your giant list of words for fast lookups.

Comments

1

You are passing trough each of the elements from targetWords, instead of checking for all words from targetWords simultaneously. In addition, you are splitting your list of words in each iteration without really needing it, creating overhead.

I would suggest that you combine your targetWords into one (compiled) regular expression:

(?xi)  # turn on comments, use case insensitive matching
\b     # word boundary, i.e. start/end of string, whitespace
(      # begin of group containing 'inventory' or 'finished' or 'goods'
 inventory|finished|goods  # bar separates alternatives
)      # end of group
\b     # word boundary

Don't forget to double-quote the backspaces in your regular expression string.

import java.util.regex.*;
...
Pattern targetPattern = Pattern.compile("(?xi)\\b(inventory|finished|goods)\\b");
for (String singleString : listOfWords) {
  if (targetPattern.matcher(singleString).find()) {
    extractedStrings.add(singleString);
  }
}

If you are not satisfied with the speed of regular expressions - although regular expression engines are usually optimized for performance - you need to roll your own high-speed multi-string search. The Aho–Corasick string matching algorithm is optimized for searching several fixed strings in text, but of course implementing this algorithm is quite some effort compared with simply creating a Pattern.

4 Comments

This is actually really clever. I like it! +1
i'm curious to see if this is any faster on very large lists, lists with very long strings, and if you're going to need to do multiple looks compared to my answer using a HashMap for lookups. somebody want to write a test??
@denov tested with War and Peace gutenberg.org/ebooks/2600 which contains 65007 lines. targetWords were as in the question. When timing it by simply checking currentTimeMillis, I get 350ms for the HashMap-based solution, 200ms for the regular expression solution, with the regex first (so the VM is still warming up). When switching HashMap before regex, its 390ms HashMap vs. 160ms regex. I did not measure the memory footprint (which should be higher for the HashMap solution as well).
@nd. thanks for coding up a test, did you look for a single needle or multiple needles in the haystack. yes, i'm sure the HashMap uses more memory.
1

I'm a little confused to if you want the whole phrase or just single words from listOfWords. If you are trying to get the string from listOfWords if one of your target words is in the string this should work for you.

    String[] targetWords= new String[]{"inventory", "finished", "goods"};
    List<String> listOfWords = new ArrayList<String>();

    // build lookup map
    Map<String, ArrayList<String>> lookupMap = new HashMap<String, ArrayList<String>>();
    for(String words : listOfWords) {
        for(String word : words.split(" ")) {
            if(lookupMap.get(word) == null) lookupMap.put(word, new ArrayList<String>());
            lookupMap.get(word).add(words);
        }
    }

    // find phrases
    Set<String> extractedStrings = new HashSet<String>();
    for(String target : targetWords) {
        if(lookupMap.containsKey(target)) extractedStrings.addAll(lookupMap.get(target));
    }

2 Comments

Sorry for the confusion, listOfWords contains phrases and I split them up to get the individual words so I can compare them against the words in the target phrase. If I'm not mistaken, wouldn't your solution potentially create duplicates for phrases that have more than one word matching? For example, assuming targetWords is the same, if I come across the phrase "inventory of goods", extractedStrings will end up containing that phrase twice since both the words "inventory" and "goods" are in the lookupMap? Should I just iterate through and remove all duplicates afterwards?
I updated my code so extractedStrings is a Set so you won't have dups.
0

I would try to implement it with ExecutorService to parallelize search for each word. http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html

For example with fixed thread pool size:

Executors.newFixedThreadPool(20);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.