Pandas string search partial string

Question

I've the following data

list = ['good dog','bad cat']

pattern = '|'.join(list)

|column|
|---|
|bad cat|
|good dog|
|cat|
|dog|

When I do the string contains in pandas, only the fully matched string gets True output as below

df[column].str.contains(pattern,regex=True)

|column|
|---|
|True|
|True|
|False|
|False|

Would it be possible to do something like fuzzy match where partial strings within a pattern are also checked for? So that output would all be true since "Cat" and "Dog" are partially present?

Thanks.

pattern = "|".join(["%("+x.replace(" ", "|")+")%" for x in list]) could work. — Ghoti
– Ghoti, Commented Jun 12, 2019 at 20:10

piRSquared · Accepted Answer · 2019-06-12 20:54:34Z

Custom Metric

Write a crude fuzzy match metric. You can probably adjust this metric by removing high frequency words and stemming appropriately.

def fuzz(a, b):
    a = np.asarray(a)
    b = np.asarray(b)
    c = a[:, None] == b[None, :]
    return min(c.max(0).mean(), c.max(1).mean())

This calculates how many words from one list matches how many words from another.

We build a dataframe to help illustrate.

d = pd.DataFrame([
    [fuzz(a, b) for b in map(str.split, lst)]
                for a in df.column.str.split()
], df.index, lst)

d

   good dog  bad cat
0       0.0      1.0
1       1.0      0.0
2       0.0      0.5
3       0.5      0.0

We can see that we get a metric of 1.0 for the first row and 'bad cat' and the second row and 'good dog'. For the third and fourth rows, we get measures of 0.5 meaning half the words matched.

Now you set a threshold and find if any in a row exceed the threshold:

For a threshold of .5

df[d.ge(.5).any(1)]

     column
0   bad cat
1  good dog
2       cat
3       dog

For a threshold of .6

df[d.ge(.6).any(1)]

     column
0   bad cat
1  good dog

Levenshtein

Use Levenshtein's distance ratio

import Levenshtein

c = pd.DataFrame([
    [Levenshtein.ratio(a, b) for b in lst]
    for a in df.column
], df.index, lst)

c

   good dog   bad cat
0  0.266667  1.000000
1  1.000000  0.266667
2  0.000000  0.600000
3  0.545455  0.200000

And you can do the same threshold analysis as above.

Collectives™ on Stack Overflow

Pandas string search partial string

1 Answer 1

Custom Metric

Levenshtein

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Custom Metric

Levenshtein

Comments

Your Answer

Sign up or log in

Post as a guest

Related