Filter Pandas dataframe by columns that are a substring of a string

Question

I am trying to filter a dataframe using the condition that string values in a column are a substring of a string outside of the dataframe. Example below:

df = [['a', 'b', 'c'], ['hello', 'bye', 'hello']]

reference_str = "hello there"

output = ['a','c']

One way is perhaps to iterate through every value in the column using regex. Was wondering if there is a more efficient way of doing it. Thanks in advance.

@Nick contains wouldn't really work here as the opposite match is wanted ;) (the logic is still a bit unclear) — mozway
– mozway, Commented Oct 2, 2023 at 1:47
@randomwerg can you please provide a valid DataFrame constructor? Do you only have full matches for words? — mozway
– mozway, Commented Oct 2, 2023 at 1:49
@Nick I see, I read too quickly ;) I can't say I'm fond of the method though — mozway
– mozway, Commented Oct 2, 2023 at 1:50
I'd use the third one with a list comprehension instead of apply, not sure it's worth a separate answer. — mozway
– mozway, Commented Oct 2, 2023 at 2:00
Thank you Nick and Mozway for the helpful solutions! Apologies for the poor syntax. — randomwerg
– randomwerg, Commented Oct 2, 2023 at 16:45

mozway · Accepted Answer · 2023-10-02 01:54:44Z

2

If you want to match full words, you could use isin on the split string:

df = pd.DataFrame({'col1': ['a', 'b', 'c'],
                   'col2': ['hello', 'bye', 'hello']})

reference_str = "hello there"

out = df[df['col2'].isin(reference_str.split())]

print(out)

If really, you want to match any substring (for instance 'the' or 'el' should match), then you have to loop:

out = df[[x in reference_str for x in df['col2']]]

Output:

  col1   col2
0    a  hello
2    c  hello

edited Oct 2, 2023 at 1:54

answered Oct 2, 2023 at 1:45

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Joe Over a year ago

Do you know why df.query(z in @reference_str) does not work on this (after naming the column z)? I tried that thinking it would be cleaner syntax, but it doesn't return any rows for this example.

mozway Over a year ago

Doesn't work how? query takes a string as input: df.query('z in @reference_str')

Joe Over a year ago

Doesn't work in that df.query('col2 in @reference_str') returns empty on your sample data. However, if I change 'col2': [5, 1, 9] and define reference_tup = (5,7,9), then df2.query('col2 in @reference_tup') returns two rows, as expected.

Nick · Accepted Answer · 2023-10-02 06:22:32Z

Out of interest I ran timings on the techniques in @mozway answer and the answers proposed here using some fairly big data (100k rows in dataframe, 1000 word reference string). To summarise:

# isin (works for whole words only, could have issues with punctuation in reference string)
out = df[df['col2'].isin(reference_str.split())]

# list comprehension
out = df[[x in reference_str for x in df['col2']]]

# apply with str.find
# note slightly modified as .ge(0) is faster than .ne(-1)
out = df[df['col2'].apply(reference_str.find).ge(0)]

# apply with str.__contains__
out = df[df['col2'].apply(reference_str.__contains__)]

# apply with in
out = df[df['col2'].apply(lambda x: x in reference_str)]

I used words from the nltk toolkit for test data. This is the test script (using isin for the example):

timeit.timeit(setup='''
import pandas as pd
import random
from nltk.corpus import words

N = 1000
wordlist = words.words()
strs = random.choices(wordlist, k=N*100)
df = pd.DataFrame({ 'col1' : range(N*100), 'col2' : strs })
reference_str = ' '.join(random.choices(wordlist, k=N))
''', stmt='''
out = df[df['col2'].isin(reference_str.split())]
''', number=1)

For matching whole words, the isin technique is far and away the fastest, about 80x the speed of any of the other variants.

Variant	Time	Ratio (isin)	Ratio (find)
isin	0.013	1	0.014
find	0.917	70.5	1
list comp	1.050	80.1	1.14
contains	1.054	81.1	1.15
in	1.066	82	1.16

However should a substring search be necessary, a find is the most efficient, about 15% faster than the other techniques.

Collectives™ on Stack Overflow

Filter Pandas dataframe by columns that are a substring of a string

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related