0

I was experimenting with fuzzywuzzy and encountered that for quite a few cases it was generating wrong result. I tried to debug and encountered a scenario with get_matching_blocks() which was difficult to explain.

My understanding of get_matching_blocks() is, it should return a triplet tuple (i,j,n) where the sub-string of length n in the first string at index i should match exactly with the sub-string of length n in the second string at index j.

>>> hay = """"Find longest matching block in a[alo:ahi] and b[blo:bhi]. If isjunk was omitted or None, find_longest_match() returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k], where alo <= i <= i+k <= ahi and blo <= j <= j+k <= bhi. For all (i', j', k') meeting those conditions, the additional conditions k >= k', i <= i', and if i == i', j <= j' are also met. In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b."""
>>> needle = "meeting those conditions"
>>> needle in hay
True
>>> sm = difflib.SequenceMatcher(None,needle,hay)
>>> sm.get_matching_blocks()
[Match(a=5, b=8, size=2), Match(a=24, b=550, size=0)]
>>> 

SO why the above code fails to find the matching block?

1 Answer 1

2

I may not see well, but you aren't matching hay and needle. You got

sm = difflib.SequenceMatcher(None,needle, sms)

shouldn't it be

sm = difflib.SequenceMatcher(None, needle, hay)

? Also, for the record, the last element in the list returned by get_matching_blocks() is a dummy of format (len(a), len(b), 0).

Maybe it's just mistake in the pasting, but then please update your question with actual code (I'm thinking of SequenceMatcher() method)


SequenceMatcher has broken "junk heuristic" - if second string is at least 200 chars long, a junk is every letter which (count-1) is more than 1% of total length. Comment from official bug ticket (namely this comment):

The reason for the bug is the heuristic: if the second sequence is at least 200 items long then any item occurring more than one percent of the time in the second sequence is treated as junk. This was aimed at recurring code lines like 'else:' and 'return', but can be fatal for small alphabets where common items are necessary content.

I will also let myself quote code examples provided by author of above:

Here len(a) == 200, len(b) == 199:

>>> print(SM(None, 'x' + 'y'*199, 'y'*199).ratio())
>>> 0.9975 #correct

Here len(a) == 199, len(b) == 200 (we switch a and b):

>>> print(SM(None, 'y'*199, 'x' + 'y'*199).ratio())
>>> 0 #wrong

Where obviously both cases should give the same output.

The bug was fixed by adding optional parameter I mentioned, autojunk, which - for correct behaviour here - should be manually set to False.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for pointing it out. I have updated the question. And I am aware of the fact the the last element in the list returned is a dummy
Okay my bad, running it in interpreter gives same result. However switching parameters produces correct output, maybe this will point someone towards answer
Okay, another update: this has something to do with autojunk parameter. Adding False as fourth argument of SequenceMatcher makes output correct
I guess this parameter is generally thought of as "controversial", as it doesn't always work as intended. It might have made situation that some letter is described as junk (when it shouldn't be) and thus break search. Unfortunately I can't find decent documentation on this specific parameter and how does this junk heuristic work, and don't have time to rummage through difflib code at the moment...
@Abhijit I have updated my answer based on what I found, hope it satisfies you. tl;dr: It's implementation "feature"
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.