fastest way to check if all elements of a list of strings is in a string

Question

I have a string

"My name is Andrew, I am pretty awesome".

Lets say I have a list of lists such as

[['andrew', 'name', 'awesome'], ['andrew', 'designation', 'awesome']]

I need my solution to return

['andrew', 'name', 'awesome']

The naive solution is :

myString='My name is Andrew, I am pretty awesome'
keywords = [['andrew', 'name', 'awesome'], ['andrew', 'designation', 'awesome']]
results=[]
for i in keywords:
 if all(substring in myString.lower() for substring in i):
    results.append(i)
print results

My issue is that when the list keywords is very very large (say 100000), there are performance bottlenecks. I need to know the most efficient way to do this.

create a set with the words you want to check, basically use myString = set('My name is Andrew, I am pretty awesome'.split()) — BlackBear
– BlackBear, Commented Jan 16, 2018 at 9:36
There is a typo on the second to last line. It should be results with an s. And btw you are returning a list of lists [['andrew', 'name', 'awesome']] — Ma0
– Ma0, Commented Jan 16, 2018 at 9:38
And you probably don't want that print statement inside the for loop. — PM 2Ring
– PM 2Ring, Commented Jan 16, 2018 at 9:39

cs95 · Accepted Answer · 2018-01-16 10:22:46Z

Thanks to BlackBear for pointing out that my timings were skewed because of the re-computation of loop invariants. On moving them out, things change, drastically.

There are two ways of doing this. The sane way, and the regex way. First, the setup.

string = "My name is Andrew, I am pretty awesome"
choices = [['andrew', 'name', 'awesome'], ['andrew', 'designation', 'awesome']]

Option 1
This one performs an in substring check inside a list comprehension. The in check runs on a modified implementation of the Boyer-Moore algorithm in C, and is very fast.

>>> [c for c in choices if all(y in string.lower() for y in c)]
[['andrew', 'name', 'awesome']]

And now, for the timings. But first, a minor performance nitpick; you can cache the value of string.lower() outside the loop, it's an invariant and doesn't need to be re-computed each time -

v = string.lower()
%timeit [c for c in choices if all(y in v for y in c)]
1000000 loops, best of 3: 2.05 µs per loop

Option 2
This one uses re.split + set.issuperset;

>>> import re
>>> [c for c in choices if set(re.split('\W', string.lower())).issuperset(c)] 
[['andrew', 'name', 'awesome']]

The use of re.split cannot be avoided, if you want to perform set checks, because of punctuation in your sentences.

Again, the set computation is a loop invariant, and can be moved out. This is how it does -

v = set(re.split('\W', string.lower()))
%timeit [c for c in choices if v.issuperset(c)] 
1000000 loops, best of 3: 1.13 µs per loop

This is an exceptional case where I find regular expressions performing marginally faster. However, these timings are not conclusive, because they vastly differ by the data's size and structure. I'd recommend trying things out with your own data before drawing any conclusions, although my gut feeling is that the regex solution would scale poorly.

Actually regexes are faster (on my machine, at least), but your code is slower because you build the set every time

Collectives™ on Stack Overflow

fastest way to check if all elements of a list of strings is in a string

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related