1

I've the following data

list = ['good dog','bad cat']

pattern = '|'.join(list)

|column|
|---|
|bad cat|
|good dog|
|cat|
|dog|

When I do the string contains in pandas, only the fully matched string gets True output as below

df[column].str.contains(pattern,regex=True)

|column|
|---|
|True|
|True|
|False|
|False|

Would it be possible to do something like fuzzy match where partial strings within a pattern are also checked for? So that output would all be true since "Cat" and "Dog" are partially present?

Thanks.

2
  • 1
    split the pattern list and join by |? Commented Jun 12, 2019 at 20:09
  • pattern = "|".join(["%("+x.replace(" ", "|")+")%" for x in list]) could work. Commented Jun 12, 2019 at 20:10

1 Answer 1

1

Custom Metric

Write a crude fuzzy match metric. You can probably adjust this metric by removing high frequency words and stemming appropriately.

def fuzz(a, b):
    a = np.asarray(a)
    b = np.asarray(b)
    c = a[:, None] == b[None, :]
    return min(c.max(0).mean(), c.max(1).mean())

This calculates how many words from one list matches how many words from another.

We build a dataframe to help illustrate.

d = pd.DataFrame([
    [fuzz(a, b) for b in map(str.split, lst)]
                for a in df.column.str.split()
], df.index, lst)

d

   good dog  bad cat
0       0.0      1.0
1       1.0      0.0
2       0.0      0.5
3       0.5      0.0

We can see that we get a metric of 1.0 for the first row and 'bad cat' and the second row and 'good dog'. For the third and fourth rows, we get measures of 0.5 meaning half the words matched.

Now you set a threshold and find if any in a row exceed the threshold:

For a threshold of .5

df[d.ge(.5).any(1)]

     column
0   bad cat
1  good dog
2       cat
3       dog

For a threshold of .6

df[d.ge(.6).any(1)]

     column
0   bad cat
1  good dog

Levenshtein

Use Levenshtein's distance ratio

import Levenshtein

c = pd.DataFrame([
    [Levenshtein.ratio(a, b) for b in lst]
    for a in df.column
], df.index, lst)

c

   good dog   bad cat
0  0.266667  1.000000
1  1.000000  0.266667
2  0.000000  0.600000
3  0.545455  0.200000

And you can do the same threshold analysis as above.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.