How is the most efficient way to intersect a list of strings with a numpy array of matches?

Question

I am using aho corasick to performing some string searches on documents. The original code uses numpy array to store in an efficient way the matches of each string of a string list:

import ahocorasick
import numpy as np

def ahocorasickFind(search_list, input):
    A = ahocorasick.Automaton()
    for idx, s in enumerate(search_list):
        A.add_word(s, (idx, s))
    A.make_automaton()

    index_list = []
    for item in A.iter(input):
        print(item)
        index_list.append(item[1][0])

    output_list = np.array([0] * len(search_list))
    output_list[index_list] = 1
    return output_list.tolist()

search_list = ['joão','maria','thiago'] # thousands of words in the real code
result = ahocorasickFind(search_list,'asdasdasd joão 1231231 thiago') # huge text in the real code
for index, element in enumerate(result):
    if(element == 1):
        print(search_list[index])

Using the above approach took to much time and memory to iterate and test (if == 1). So, how to get "original" strings found in the input text in a perfomatic way?

Marco Spinaci · Accepted Answer · 2022-01-05 09:25:26Z

1

If you are only interested in matching for words (i.e. separated by a white space), rather than using a full search text, it might be faster to use a set of words. Note, however, that this uses some additional memory. One straightforward solution to replicate your behaviour would be:

words = set(text.split())
for w in search_list:
    if w in words:
        print(w)

or even shorter (but changing the order of the result, and deleting duplicates from the search list):

for w in set(search_list).intersection(text.split()):
    print(w)

I've quickly tested it on relatively large text object (143M characters, 23M words) and a rather short search_list object (606 words, of which 295 unique ones), and the times I got are:

corasick: 14.5s
first version above: 4.6s
second version above: 2.6s (this speedup is just due to doing half the work only by skipping duplicates)

However the first version uses a (relatively) negligible amount of additional memory, while the other versions use quite a lot of it (for the data I was using, could be almost 2GB of additional memory)

answered Jan 5, 2022 at 9:25

Marco Spinaci

1,8791 gold badge17 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

celsowm Over a year ago

Wow, I did not know that a simple for + in would be faster than corasick... Could you test if using list comprehension will affect the performance?

Marco Spinaci Over a year ago

The main trick here is that the list of words (returned from split()) has been converted to a set, and looking if an object is in a set is O(1) independently of how many words are in there. I don't know corasick, but I suspect it checks general substrings, hence why it can be slower. What do you mean by list comprehension? To replace the for loop? If so it won't change anything, the list/set the loop above is iterating over is "tiny" (a few hundred words), it doesn't affect runtime at all. Well over 99% of the runtime is taken to apply split() and to convert the resulting list to a set.

Collectives™ on Stack Overflow

How is the most efficient way to intersect a list of strings with a numpy array of matches?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related