Find lots of string in text - Python

Question

I'm searching for the best algorithm to resolve this problem: having a list (or a dict, a set) of small sentences, find the all occurrences of this sentences in a bigger text. The sentences in the list (or dict, or set) are about 600k but formed, on average, by 3 words. The text is, on average, 25 words long. I've just formatted the text (deleting punctuation, all lowercase and go on like this).

Here is what I have tried out (Python):

to_find_sentences = [
    'bla bla',
    'have a tea',
    'hy i m luca',
    'i love android',
    'i love ios',
    .....
]

text = 'i love android and i think i will have a tea with john'

def find_sentence(to_find_sentences, text):
    text = text.split()
    res = []
    w = len(text)
    for i in range(w):
        for j in range(i+1,w+1):
            tmp = ' '.join(descr[i:j])
            if tmp in to_find_sentences:
                res.add(tmp)
    return res


print find_sentence(to_find_sentence, text)

Out:

['i love android', 'have a tea']

In my case I've used a set to speed up the in operation

It's too broad a question but you may try organize the many many small query strings into a prefix tree (or something else, depending on the characteristics of the query strings). In this way the code can be smarter skipping impossible queries and testing/refining partial matches. — Cong Ma
– Cong Ma, Commented Apr 26, 2017 at 8:42

Community · Accepted Answer · 2017-05-23 12:02:42Z

6

A fast solution would be to build a Trie out of your sentences and convert this trie to a regex. For your example, the pattern would look like this:

(?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))

Here's an example on debuggex:

It might be a good idea to add '\b' as word boundaries, to avoid matching "have a team".

You'll need a small Trie script. It's not an official package yet, but you can simply download it here as trie.py in your current directory.

You can then use this code to generate the trie/regex:

import re
from trie import Trie

to_find_sentences = [
    'bla bla',
    'have a tea',
    'hy i m luca',
    'i love android',
    'i love ios',
]

trie = Trie()
for sentence in to_find_sentences:
    trie.add(sentence)

print(trie.pattern())
# (?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))

pattern = re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
text = 'i love android and i think i will have a tea with john'

print(re.findall(pattern, text))
# ['i love android', 'have a tea']

You invest some time to create the Trie and the regex, but the processing should be extremely fast.

Here's a related answer (Speed up millions of regex replacements in Python 3) if you want more information.

Note that it wouldn't find overlapping sentences:

to_find_sentences = [
    'i love android',
    'android Marshmallow'
]
# ...
print(re.findall(pattern, "I love android Marshmallow"))
# ['I love android']

You'd have to modifiy the regex with positive lookaheads to find overlapping sentences.

edited May 23, 2017 at 12:02

CommunityBot

11 silver badge

answered Apr 26, 2017 at 8:46

Eric Duminil

54.6k10 gold badges80 silver badges134 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Luca Di Liello Over a year ago

Thank you for this answer. I'm using python2, and when I run it says that the constructor Trie() need at least 2 argument. To install it I've used pip install trie. I think this is the wrong library because if I give the list to the constructor, it says that trie.pattern() does not exist.

Eric Duminil Over a year ago

@LucaDiLiello: Sorry about that. My small script isn't an official package yet. I edited the answer.

Luca Di Liello Over a year ago

Hi, do you know about a C++ version of your code to speed up the process? With about 750k of words in the trie, I'm always going in OutOfMemory exception. So I'm trying to produce the regex's string in C++ and finally compiling it in Python.

Eric Duminil Over a year ago

@LucaDiLiello: Sorry, it's been more than 10 years I didn't write any C++. You might have more success with pygtrie. A Google lib shouldn't have any problem with 750k words.

Collectives™ on Stack Overflow

Find lots of string in text - Python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related