Similarity measure for Strings in Python

Question

I want to measure the similarity between two words. The idea is to read a text with OCR and check the result for keywords. The function I'm looking for should compare two words and return the similarity in %. So comparing a word with itself should be 100% similar. I wrote a function on my own and compared char by char and returned the number of matches in ratio to the length. But the Problem is that

wordComp('h0t',hot')
0.66
wordComp('tackoverflow','stackoverflow')
0

But intuitive both examples should have very high similarity >90%. Adding the Levenstein-Distance

import nltk
nltk.edit_distance('word1','word2')

in my function will increase the second result up to 92% but the first result is still not good.

I already found this solution for "R" and it would be possible to use this functions with rpy2 or use agrepy as another approach. But I want to make the program more and less sensitive by changing the benchmark for acceptance (Only accept matches with similarity > x%).

Is there another good measure I could use or do you have any ideas to improve my function?

ragamuffin · Accepted Answer · 2018-11-29 12:18:59Z

9

You could just use difflib. This function I got from an answer some time ago has served me well:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))

0.96
0.666666666667

You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:

from difflib import SequenceMatcher

def similar(a, b, c):
    sim = SequenceMatcher(None, a, b).ratio()
    if sim > c: 
        return sim

print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))

0.96
None

edited Nov 29, 2018 at 12:18

answered Nov 29, 2018 at 12:04

ragamuffin

4603 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

tifi90 Over a year ago

Thank you for the idea. This helps me with the first problem but the problem with short words is still open unanswered. Any other ideas on that?

ragamuffin Over a year ago

Im not quite sure why you want a higher value for the three letter word. You say that intuitively you expected a higher similarity. Strictly speaking, out of three characters one is different between the strings, which makes them 66% similar. Can you elaborate on what your expected outcome should be and why?

tifi90 Over a year ago

I don't know what the exact outcome should. The point that makes me thing of a higher score is if you compare h0t and hxt than in an intuitive way h0t is closer to hot than hxt since 0 and o are nearly the same. Just imagine if this where handwritten you wouldn't really mark h0t as wrong but hxt is clearly.

ragamuffin Over a year ago

Well yea, they are aestetically similar, I dont know of any way to test for that. That is quite subjective as well, isnt it? For all intents and purposes x and o and 0 are equally dissimilar to one another.

tifi90 Over a year ago

I just thought about the following "quick and dirty" fix: Just map digits to chars with a fixed mapping (0->o, 5->s,3->E,9->g,...). Since I'm searching for real words a zero or five or what ever number should never be part of the keyword.

|

MH.AI.eAgLe · Accepted Answer · 2018-12-01 10:20:18Z

0

I wrote the following code. try it. I defined a str3 for those occasions that length of two comparing string(str1 and str2) is not equal. the code is in while loop for exiting use k input.

k=1
cnt=0
str3=''
while not k==-1:
    str1=input()
    str2=input()
    k=int(input())

    if len(str1)>len(str2):
        str3=str1[0:len(str2)]
        for j in range(0,len(str3)):
            if str3[j]==str2[j]:
                cnt+=1
        print((cnt/len(str1)*100))

    elif len(str1)<len(str2):
        str3=str2[0:len(str1)]
        for j in range(0,len(str2)):
            if str3[j]==str1[j]:
                cnt+=1
        print((cnt/len(str2)*100))

    else:
        for j in range(0,len(str2)):
            if str2[j]==str1[j]:
                cnt+=1
        print((cnt/len(str1)*100))

edited Dec 1, 2018 at 10:20

answered Nov 29, 2018 at 11:28

MH.AI.eAgLe

7031 gold badge9 silver badges26 bronze badges

1 Comment

tifi90 Over a year ago

thanks for sharing your code. This looks like what I have tried in the first place. You get similar results with this function like I did. The main problem I see is that you'll lose a lot of information when you cut the string str3=str2[0:len(str1)].

Collectives™ on Stack Overflow

Similarity measure for Strings in Python

2 Answers 2

7 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related