0

I am trying to filter a dataframe using the condition that string values in a column are a substring of a string outside of the dataframe. Example below:

df = [['a', 'b', 'c'], ['hello', 'bye', 'hello']]

reference_str = "hello there"

output = ['a','c']

One way is perhaps to iterate through every value in the column using regex. Was wondering if there is a more efficient way of doing it. Thanks in advance.

5
  • @Nick contains wouldn't really work here as the opposite match is wanted ;) (the logic is still a bit unclear) Commented Oct 2, 2023 at 1:47
  • @randomwerg can you please provide a valid DataFrame constructor? Do you only have full matches for words? Commented Oct 2, 2023 at 1:49
  • @Nick I see, I read too quickly ;) I can't say I'm fond of the method though Commented Oct 2, 2023 at 1:50
  • I'd use the third one with a list comprehension instead of apply, not sure it's worth a separate answer. Commented Oct 2, 2023 at 2:00
  • Thank you Nick and Mozway for the helpful solutions! Apologies for the poor syntax. Commented Oct 2, 2023 at 16:45

2 Answers 2

2

If you want to match full words, you could use isin on the split string:

df = pd.DataFrame({'col1': ['a', 'b', 'c'],
                   'col2': ['hello', 'bye', 'hello']})

reference_str = "hello there"

out = df[df['col2'].isin(reference_str.split())]

print(out)

If really, you want to match any substring (for instance 'the' or 'el' should match), then you have to loop:

out = df[[x in reference_str for x in df['col2']]]

Output:

  col1   col2
0    a  hello
2    c  hello
Sign up to request clarification or add additional context in comments.

3 Comments

Do you know why df.query(z in @reference_str) does not work on this (after naming the column z)? I tried that thinking it would be cleaner syntax, but it doesn't return any rows for this example.
Doesn't work how? query takes a string as input: df.query('z in @reference_str')
Doesn't work in that df.query('col2 in @reference_str') returns empty on your sample data. However, if I change 'col2': [5, 1, 9] and define reference_tup = (5,7,9), then df2.query('col2 in @reference_tup') returns two rows, as expected.
2

Out of interest I ran timings on the techniques in @mozway answer and the answers proposed here using some fairly big data (100k rows in dataframe, 1000 word reference string). To summarise:

# isin (works for whole words only, could have issues with punctuation in reference string)
out = df[df['col2'].isin(reference_str.split())]

# list comprehension
out = df[[x in reference_str for x in df['col2']]]

# apply with str.find
# note slightly modified as .ge(0) is faster than .ne(-1)
out = df[df['col2'].apply(reference_str.find).ge(0)]

# apply with str.__contains__
out = df[df['col2'].apply(reference_str.__contains__)]

# apply with in
out = df[df['col2'].apply(lambda x: x in reference_str)]

I used words from the nltk toolkit for test data. This is the test script (using isin for the example):

timeit.timeit(setup='''
import pandas as pd
import random
from nltk.corpus import words

N = 1000
wordlist = words.words()
strs = random.choices(wordlist, k=N*100)
df = pd.DataFrame({ 'col1' : range(N*100), 'col2' : strs })
reference_str = ' '.join(random.choices(wordlist, k=N))
''', stmt='''
out = df[df['col2'].isin(reference_str.split())]
''', number=1)

For matching whole words, the isin technique is far and away the fastest, about 80x the speed of any of the other variants.

Variant Time Ratio (isin) Ratio (find)
isin 0.013 1 0.014
find 0.917 70.5 1
list comp 1.050 80.1 1.14
contains 1.054 81.1 1.15
in 1.066 82 1.16

However should a substring search be necessary, a find is the most efficient, about 15% faster than the other techniques.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.