2

I have a dictionary which contains a big number of strings. Each string could have a range of 1 to 4 tokens (words). Example :

Dictionary :

  1. The Shawshank Redemption
  2. The Godfather \
  3. Pulp Fiction
  4. The Dark Knight
  5. Fight Club

Now I have a paragraph and I need to figure out how many strings in the para are part of the dictionary. Example, when the para below :

The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.

is run against the dictionary, I should be getting the ones in bold as the ones that are part of the dictionary.

How can I do this with the least dictionary calls.

Thanks

9
  • How massive is the dictionary supposed to be? 'How can I do this with the least dictionary calls' So its about lookup complexity, rather than memory usage - right? Any constraints regarding the programming language? Commented Dec 6, 2013 at 23:03
  • How is the paragraph given, one long string or a file? Commented Dec 6, 2013 at 23:05
  • The dictionary has a few million multi token strings. Ya, lookup complexity is the major issue. No constraints reg programming language. The para would be a blob of text Commented Dec 6, 2013 at 23:14
  • The Aho-Corasick string matching algorithm is what you want. It builds a trie, and searches are incredibly efficient. It's interesting to note that I typed the question title here into Google and found en.wikipedia.org/wiki/String_searching_algorithm as the second search result, where clicking on "Algorithms using a finite set of patterns" got me to Aho-Corasick. Learning how to use a search engine can be very helpful. Commented Dec 7, 2013 at 0:48
  • How long is expected length of your text? how long are dictionary elements themselves? Commented Dec 7, 2013 at 0:53

3 Answers 3

2

You might be better off using a Trie. A Trie is better suited to finding partial matches (i.e. as you search through the text of a paragraph) that are potentially what you're looking for, as opposed to making a bunch of calls to a dictionary that will mostly fail.

The reason why I think a Trie (or some variation) is appropriate is because it's built to do exactly what you're trying to do:

Basic Trie

If you use this (or some modification that has the tokenized words at each node instead of a letter), this would be the most efficient (at least that I know of) in terms of storage and retrieval; Storage because instead of storing the word "The" a couple thousand times in each Dict entry that has that word in the title (as is the case with movie titles), it would be stored once in one of the nodes right under the root. The next word, "Shawshank" would be in a child node, and then "redemption" would be in the next, with a total of 3 lookups; then you would move to the next phrase. If it fails, i.e. the phrase is only "The Shawshank Looper", then you fail after the same 3 lookups, and you move to the failed word, Looper (which as it happens, would also be a child node under the root, and you get a hit. This solution works assuming you're reading a paragraph without mashup movie names).

Using a hash table, you're going to have to split all the words, check the first word, and then while there's no match, keep appending words and checking if THAT phrase is in the dictionary, until you get a hit, or you reach the end of the paragraph. So if you hit a paragraph with no movie titles, you would have as many lookups as there are words in the paragraph.

Sign up to request clarification or add additional context in comments.

5 Comments

My dictionary is massive. It will be impossible to make a trie out of it.
If author decides to go with a Trie, then probably optimizing it to a Suffix Tree may be a better idea?
@paypalcomp: Impossible? A trie will ultimately result in less storage space, and they're easy to build as you are initializing the dictionary. BrDaHa isn't suggesting you have a dictionary of words AND a trie, but JUST a trie, which is great for exactly what you need. A suffix tree being a version of a trie makes it, too, a good idea.
I agree, there are traditional space efficient algorithms (e.g. Suffix Arrays, Suffix Automata) that can be used in O(n + mk), i.e. construction is done in O(n) and for k patterns do lookup in O(m). Obviously some heuristics can be used with this to speed up. However, the most significant optimization, in my opinion, would be to filter the dictionary itself, i.e. consider only top 1000 elements instead of all 1,000,000.
@ile: Done correctly, once the trie is constructed the search speed is not affected by the number of search strings--only the length of the text being searched and the number of matches found. So filtering the dictionary is irrelevant.
1

This is not a complete answer, more like an extended-comment.

In literature it's called "multi-pattern matching problem". Since you mentioned that the set of patterns has millions of elements, Trie based solutions will most probably perform poorly.

As far as I know, in practice traditional string search is used with a lot of heuristics. DNA search, antivirus detection, etc. all of these fields need fast and reliable pattern matching, so there should be decent amount of research done.

I can imagine how Rabin-Karp with rolling-hash functions and some filters (Bloom filter) can be used in order to speed up the process. For example, instead of actually matching the substrings, you could first filter (e.g. with weak-hashes) and then actually verify, thus reducing number of verifications needed. Plus this should reduce the work done with the original dictionary itself, as you would store it's hashes, or other filters.

Comments

-2

In Python:

import re

movies={1:'The Shawshank Redemption', 2:'The Godfather', 3:'Pretty Woman', 4:'Pulp Fiction'}

text = 'The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.'

repl_str ='(?P<title>' + '|'.join(['(?:%s)' %movie for movie in movies.values()]) + ')'

result = re.sub(repl_str, '<b>\g<title></b>',text)

Basically it consists of forming up a big substitution instruction string out of your dict values. I don't know whether regex and sub have a limitation in the size of the substitution instructions you give them though. You might want to check.

lai

8 Comments

Just saw that your dictionary is massive. You may loop over it in trunks of 20 movies with the above approach.
Another comment: you should sort your dict in descending alphabetical order before doing this. Otherwise you won't catch properly "The Godfather II".
But given a string from the para, like : "The Shawshank Redemption considered the.." you would have to make calls for all the 1,2,3 and 4 token combinations of the entire para (Bruteforce). Was wondering if there is a better way out.
example : if your para is " a,b,c,d" : the combinations for which you would make dictionary calls would be :a,b,c,d,ab,bc,cd,abc,bcd,abcd.
e.g. a Trie (mentioned in another solution) constructed from the whole dictionary will be smaller than the instruction in this solution.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.