Out of interest I ran timings on the techniques in @mozway answer and the answers proposed here using some fairly big data (100k rows in dataframe, 1000 word reference string). To summarise:
# isin (works for whole words only, could have issues with punctuation in reference string)
out = df[df['col2'].isin(reference_str.split())]
# list comprehension
out = df[[x in reference_str for x in df['col2']]]
# apply with str.find
# note slightly modified as .ge(0) is faster than .ne(-1)
out = df[df['col2'].apply(reference_str.find).ge(0)]
# apply with str.__contains__
out = df[df['col2'].apply(reference_str.__contains__)]
# apply with in
out = df[df['col2'].apply(lambda x: x in reference_str)]
I used words from the nltk toolkit for test data. This is the test script (using isin for the example):
timeit.timeit(setup='''
import pandas as pd
import random
from nltk.corpus import words
N = 1000
wordlist = words.words()
strs = random.choices(wordlist, k=N*100)
df = pd.DataFrame({ 'col1' : range(N*100), 'col2' : strs })
reference_str = ' '.join(random.choices(wordlist, k=N))
''', stmt='''
out = df[df['col2'].isin(reference_str.split())]
''', number=1)
For matching whole words, the isin technique is far and away the fastest, about 80x the speed of any of the other variants.
| Variant |
Time |
Ratio (isin) |
Ratio (find) |
| isin |
0.013 |
1 |
0.014 |
| find |
0.917 |
70.5 |
1 |
| list comp |
1.050 |
80.1 |
1.14 |
| contains |
1.054 |
81.1 |
1.15 |
| in |
1.066 |
82 |
1.16 |
However should a substring search be necessary, a find is the most efficient, about 15% faster than the other techniques.
containswouldn't really work here as the opposite match is wanted ;) (the logic is still a bit unclear)apply, not sure it's worth a separate answer.