4

I want to measure the similarity between two words. The idea is to read a text with OCR and check the result for keywords. The function I'm looking for should compare two words and return the similarity in %. So comparing a word with itself should be 100% similar. I wrote a function on my own and compared char by char and returned the number of matches in ratio to the length. But the Problem is that

wordComp('h0t',hot')
0.66
wordComp('tackoverflow','stackoverflow')
0

But intuitive both examples should have very high similarity >90%. Adding the Levenstein-Distance

import nltk
nltk.edit_distance('word1','word2')

in my function will increase the second result up to 92% but the first result is still not good.

I already found this solution for "R" and it would be possible to use this functions with rpy2 or use agrepy as another approach. But I want to make the program more and less sensitive by changing the benchmark for acceptance (Only accept matches with similarity > x%).

Is there another good measure I could use or do you have any ideas to improve my function?

2 Answers 2

9

You could just use difflib. This function I got from an answer some time ago has served me well:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))

0.96
0.666666666667

You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:

from difflib import SequenceMatcher

def similar(a, b, c):
    sim = SequenceMatcher(None, a, b).ratio()
    if sim > c: 
        return sim

print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))

0.96
None
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you for the idea. This helps me with the first problem but the problem with short words is still open unanswered. Any other ideas on that?
Im not quite sure why you want a higher value for the three letter word. You say that intuitively you expected a higher similarity. Strictly speaking, out of three characters one is different between the strings, which makes them 66% similar. Can you elaborate on what your expected outcome should be and why?
I don't know what the exact outcome should. The point that makes me thing of a higher score is if you compare h0t and hxt than in an intuitive way h0t is closer to hot than hxt since 0 and o are nearly the same. Just imagine if this where handwritten you wouldn't really mark h0t as wrong but hxt is clearly.
Well yea, they are aestetically similar, I dont know of any way to test for that. That is quite subjective as well, isnt it? For all intents and purposes x and o and 0 are equally dissimilar to one another.
I just thought about the following "quick and dirty" fix: Just map digits to chars with a fixed mapping (0->o, 5->s,3->E,9->g,...). Since I'm searching for real words a zero or five or what ever number should never be part of the keyword.
|
0

I wrote the following code. try it. I defined a str3 for those occasions that length of two comparing string(str1 and str2) is not equal. the code is in while loop for exiting use k input.

k=1
cnt=0
str3=''
while not k==-1:
    str1=input()
    str2=input()
    k=int(input())

    if len(str1)>len(str2):
        str3=str1[0:len(str2)]
        for j in range(0,len(str3)):
            if str3[j]==str2[j]:
                cnt+=1
        print((cnt/len(str1)*100))

    elif len(str1)<len(str2):
        str3=str2[0:len(str1)]
        for j in range(0,len(str2)):
            if str3[j]==str1[j]:
                cnt+=1
        print((cnt/len(str2)*100))

    else:
        for j in range(0,len(str2)):
            if str2[j]==str1[j]:
                cnt+=1
        print((cnt/len(str1)*100))

1 Comment

thanks for sharing your code. This looks like what I have tried in the first place. You get similar results with this function like I did. The main problem I see is that you'll lose a lot of information when you cut the string str3=str2[0:len(str1)].

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.