Custom Metric
Write a crude fuzzy match metric. You can probably adjust this metric by removing high frequency words and stemming appropriately.
def fuzz(a, b):
a = np.asarray(a)
b = np.asarray(b)
c = a[:, None] == b[None, :]
return min(c.max(0).mean(), c.max(1).mean())
This calculates how many words from one list matches how many words from another.
We build a dataframe to help illustrate.
d = pd.DataFrame([
[fuzz(a, b) for b in map(str.split, lst)]
for a in df.column.str.split()
], df.index, lst)
d
good dog bad cat
0 0.0 1.0
1 1.0 0.0
2 0.0 0.5
3 0.5 0.0
We can see that we get a metric of 1.0 for the first row and 'bad cat' and the second row and 'good dog'. For the third and fourth rows, we get measures of 0.5 meaning half the words matched.
Now you set a threshold and find if any in a row exceed the threshold:
For a threshold of .5
df[d.ge(.5).any(1)]
column
0 bad cat
1 good dog
2 cat
3 dog
For a threshold of .6
df[d.ge(.6).any(1)]
column
0 bad cat
1 good dog
Levenshtein
Use Levenshtein's distance ratio
import Levenshtein
c = pd.DataFrame([
[Levenshtein.ratio(a, b) for b in lst]
for a in df.column
], df.index, lst)
c
good dog bad cat
0 0.266667 1.000000
1 1.000000 0.266667
2 0.000000 0.600000
3 0.545455 0.200000
And you can do the same threshold analysis as above.
|?