How to check if a string is in a longer string in pandas DataFrame?

Question

I know it's quite straightforward to use df.str.contains() to check if the column contains a certain substring.

What if I want to do the other way around: check if the column's value is contained by a longer string? I did a search but couldn't find an answer. I thought this should be easy, like in pure python we could simply 'a' in 'abc'

I tried to use df.isin but seems it's not designed for this purpose.

Say I have a df looks like this:

       col1      col2
0     'apple'    'one'
1     'orange'   'two'
2     'banana'   'three'

I want to query this df on col1 if is contained by a string appleorangefruits, it should return me the first two rows.

Can you create minimal reproducible example, that would explain a lot. — harpan
– harpan, Commented Aug 15, 2019 at 15:46
Is the longer string you want to check against a constant, or does it vary from case to case? — Kevin Troy
– Kevin Troy, Commented Aug 15, 2019 at 15:48
@KevinTroythanks Kevin. It varies, for example, I have a column called ID in the df. But somehow the user provides me another format of ID which is a bit longer. I want to iterate the ID list to find out those matched rows. — Ev3rlasting
– Ev3rlasting, Commented Aug 15, 2019 at 16:15

Little Bobby Tables · Accepted Answer · 2019-08-15 16:32:37Z

As apply is notoriously slow I thought I'd have a play with some other ideas.

If your "long_string" is relatively short and your DataFrame is massive, you could do something weird like this.

from itertools import combinations
from random import choice

# Create a large DataFrame
df = pd.DataFrame(
    data={'test' : [choice('abcdef') for i in range(10_000_000)]}
)

long_string = 'abcdnmlopqrtuvqwertyuiop'

def get_all_substrings(input_string):
    length = len(input_string)
    return [input_string[i:j + 1] for i in range(length) for j in range(i,length)]

sub_strings = get_all_substrings(long_string)

df.test.isin(sub_strings)

This ran in about 300ms vs 2.89s for the above apply(lambda a: a in 'longer string') answers. This is ten times quicker!

Note: I used the get_all_substrings functions from How To Get All The Contiguous Substrings Of A String In Python?

Yifei H · Accepted Answer · 2019-08-15 15:49:43Z

4

You can call an apply on the column, i.e.:

df['your col'].apply(lambda a: a in 'longer string')

answered Aug 15, 2019 at 15:49

Yifei H

762 bronze badges

Comments

harpan · Accepted Answer · 2019-08-15 16:41:20Z

3

You need:

longstring = 'appleorangefruits'
df.loc[df['col1'].apply(lambda x: x in longstring)]

Output:

    col1    col2
0   apple   one
1   orange  two

answered Aug 15, 2019 at 16:41

harpan

8,6412 gold badges22 silver badges40 bronze badges

Comments

IWHKYB · Accepted Answer · 2019-08-15 15:49:41Z

2

If the string you are checking against is a constant, I believe you can achieve it by using DataFrame.apply:

df.apply(lambda row: row['mycol'] in 'mystring', axis=1)

answered Aug 15, 2019 at 15:49

IWHKYB

4914 silver badges12 bronze badges

Comments

Karn Kumar · Accepted Answer · 2019-08-15 17:18:00Z

1

try..

>>> df[df.col1.apply(lambda x: x in 'appleorangefruits')]
     col1 col2
0   apple  one
1  orange  two

answered Aug 15, 2019 at 17:18

Karn Kumar

8,8343 gold badges32 silver badges61 bronze badges

Collectives™ on Stack Overflow

How to check if a string is in a longer string in pandas DataFrame?

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related