1

I am using aho corasick to performing some string searches on documents. The original code uses numpy array to store in an efficient way the matches of each string of a string list:

import ahocorasick
import numpy as np

def ahocorasickFind(search_list, input):
    A = ahocorasick.Automaton()
    for idx, s in enumerate(search_list):
        A.add_word(s, (idx, s))
    A.make_automaton()

    index_list = []
    for item in A.iter(input):
        print(item)
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()

search_list = ['joão','maria','thiago'] # thousands of words in the real code
result = ahocorasickFind(search_list,'asdasdasd joão 1231231 thiago') # huge text in the real code
for index, element in enumerate(result):
    if(element == 1):
        print(search_list[index])

Using the above approach took to much time and memory to iterate and test (if == 1). So, how to get "original" strings found in the input text in a perfomatic way?

1 Answer 1

1

If you are only interested in matching for words (i.e. separated by a white space), rather than using a full search text, it might be faster to use a set of words. Note, however, that this uses some additional memory. One straightforward solution to replicate your behaviour would be:

words = set(text.split())
for w in search_list:
    if w in words:
        print(w)

or even shorter (but changing the order of the result, and deleting duplicates from the search list):

for w in set(search_list).intersection(text.split()):
    print(w)

I've quickly tested it on relatively large text object (143M characters, 23M words) and a rather short search_list object (606 words, of which 295 unique ones), and the times I got are:

  • corasick: 14.5s
  • first version above: 4.6s
  • second version above: 2.6s (this speedup is just due to doing half the work only by skipping duplicates)

However the first version uses a (relatively) negligible amount of additional memory, while the other versions use quite a lot of it (for the data I was using, could be almost 2GB of additional memory)

Sign up to request clarification or add additional context in comments.

2 Comments

Wow, I did not know that a simple for + in would be faster than corasick... Could you test if using list comprehension will affect the performance?
The main trick here is that the list of words (returned from split()) has been converted to a set, and looking if an object is in a set is O(1) independently of how many words are in there. I don't know corasick, but I suspect it checks general substrings, hence why it can be slower. What do you mean by list comprehension? To replace the for loop? If so it won't change anything, the list/set the loop above is iterating over is "tiny" (a few hundred words), it doesn't affect runtime at all. Well over 99% of the runtime is taken to apply split() and to convert the resulting list to a set.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.