5

I have a string

"My name is Andrew, I am pretty awesome".

Lets say I have a list of lists such as

[['andrew', 'name', 'awesome'], ['andrew', 'designation', 'awesome']]

I need my solution to return

['andrew', 'name', 'awesome']

The naive solution is :

myString='My name is Andrew, I am pretty awesome'
keywords = [['andrew', 'name', 'awesome'], ['andrew', 'designation', 'awesome']]
results=[]
for i in keywords:
 if all(substring in myString.lower() for substring in i):
    results.append(i)
print results

My issue is that when the list keywords is very very large (say 100000), there are performance bottlenecks. I need to know the most efficient way to do this.

6
  • 1
    create a set with the words you want to check, basically use myString = set('My name is Andrew, I am pretty awesome'.split()) Commented Jan 16, 2018 at 9:36
  • Are you sure this code even works? Commented Jan 16, 2018 at 9:38
  • 1
    There is a typo on the second to last line. It should be results with an s. And btw you are returning a list of lists [['andrew', 'name', 'awesome']] Commented Jan 16, 2018 at 9:38
  • And you probably don't want that print statement inside the for loop. Commented Jan 16, 2018 at 9:39
  • and lowercase myString in all function Commented Jan 16, 2018 at 9:40

1 Answer 1

5

Thanks to BlackBear for pointing out that my timings were skewed because of the re-computation of loop invariants. On moving them out, things change, drastically.

There are two ways of doing this. The sane way, and the regex way. First, the setup.

string = "My name is Andrew, I am pretty awesome"
choices = [['andrew', 'name', 'awesome'], ['andrew', 'designation', 'awesome']]

Option 1
This one performs an in substring check inside a list comprehension. The in check runs on a modified implementation of the Boyer-Moore algorithm in C, and is very fast.

>>> [c for c in choices if all(y in string.lower() for y in c)]
[['andrew', 'name', 'awesome']]

And now, for the timings. But first, a minor performance nitpick; you can cache the value of string.lower() outside the loop, it's an invariant and doesn't need to be re-computed each time -

v = string.lower()
%timeit [c for c in choices if all(y in v for y in c)]
1000000 loops, best of 3: 2.05 µs per loop

Option 2
This one uses re.split + set.issuperset;

>>> import re
>>> [c for c in choices if set(re.split('\W', string.lower())).issuperset(c)] 
[['andrew', 'name', 'awesome']]

The use of re.split cannot be avoided, if you want to perform set checks, because of punctuation in your sentences.

Again, the set computation is a loop invariant, and can be moved out. This is how it does -

v = set(re.split('\W', string.lower()))
%timeit [c for c in choices if v.issuperset(c)] 
1000000 loops, best of 3: 1.13 µs per loop

This is an exceptional case where I find regular expressions performing marginally faster. However, these timings are not conclusive, because they vastly differ by the data's size and structure. I'd recommend trying things out with your own data before drawing any conclusions, although my gut feeling is that the regex solution would scale poorly.

Sign up to request clarification or add additional context in comments.

1 Comment

Actually regexes are faster (on my machine, at least), but your code is slower because you build the set every time

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.