3

I know it's quite straightforward to use df.str.contains() to check if the column contains a certain substring.

What if I want to do the other way around: check if the column's value is contained by a longer string? I did a search but couldn't find an answer. I thought this should be easy, like in pure python we could simply 'a' in 'abc'

I tried to use df.isin but seems it's not designed for this purpose.

Say I have a df looks like this:

       col1      col2
0     'apple'    'one'
1     'orange'   'two'
2     'banana'   'three'

I want to query this df on col1 if is contained by a string appleorangefruits, it should return me the first two rows.

4
  • 3
    Can you create minimal reproducible example, that would explain a lot. Commented Aug 15, 2019 at 15:46
  • 1
    Is the longer string you want to check against a constant, or does it vary from case to case? Commented Aug 15, 2019 at 15:48
  • 1
    @harvpan thanks. added a simple example Commented Aug 15, 2019 at 16:07
  • 1
    @KevinTroythanks Kevin. It varies, for example, I have a column called ID in the df. But somehow the user provides me another format of ID which is a bit longer. I want to iterate the ID list to find out those matched rows. Commented Aug 15, 2019 at 16:15

5 Answers 5

4

As apply is notoriously slow I thought I'd have a play with some other ideas.

If your "long_string" is relatively short and your DataFrame is massive, you could do something weird like this.

from itertools import combinations
from random import choice

# Create a large DataFrame
df = pd.DataFrame(
    data={'test' : [choice('abcdef') for i in range(10_000_000)]}
)

long_string = 'abcdnmlopqrtuvqwertyuiop'

def get_all_substrings(input_string):
    length = len(input_string)
    return [input_string[i:j + 1] for i in range(length) for j in range(i,length)]

sub_strings = get_all_substrings(long_string)

df.test.isin(sub_strings)

This ran in about 300ms vs 2.89s for the above apply(lambda a: a in 'longer string') answers. This is ten times quicker!

Note: I used the get_all_substrings functions from How To Get All The Contiguous Substrings Of A String In Python?

Sign up to request clarification or add additional context in comments.

Comments

4

You can call an apply on the column, i.e.:

df['your col'].apply(lambda a: a in 'longer string')

Comments

3

You need:

longstring = 'appleorangefruits'
df.loc[df['col1'].apply(lambda x: x in longstring)]

Output:

    col1    col2
0   apple   one
1   orange  two

Comments

2

If the string you are checking against is a constant, I believe you can achieve it by using DataFrame.apply:

df.apply(lambda row: row['mycol'] in 'mystring', axis=1)

Comments

1

try..

>>> df[df.col1.apply(lambda x: x in 'appleorangefruits')]
     col1 col2
0   apple  one
1  orange  two

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.