Unexplaned behavior with difflib.SequenceMatcher get_matching_blocks()

Question

I was experimenting with fuzzywuzzy and encountered that for quite a few cases it was generating wrong result. I tried to debug and encountered a scenario with get_matching_blocks() which was difficult to explain.

My understanding of get_matching_blocks() is, it should return a triplet tuple (i,j,n) where the sub-string of length n in the first string at index i should match exactly with the sub-string of length n in the second string at index j.

>>> hay = """"Find longest matching block in a[alo:ahi] and b[blo:bhi]. If isjunk was omitted or None, find_longest_match() returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k], where alo <= i <= i+k <= ahi and blo <= j <= j+k <= bhi. For all (i', j', k') meeting those conditions, the additional conditions k >= k', i <= i', and if i == i', j <= j' are also met. In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b."""
>>> needle = "meeting those conditions"
>>> needle in hay
True
>>> sm = difflib.SequenceMatcher(None,needle,hay)
>>> sm.get_matching_blocks()
[Match(a=5, b=8, size=2), Match(a=24, b=550, size=0)]
>>>

SO why the above code fails to find the matching block?

wasyl · Accepted Answer · 2012-08-27 15:36:10Z

2

I may not see well, but you aren't matching hay and needle. You got

sm = difflib.SequenceMatcher(None,needle, sms)

shouldn't it be

sm = difflib.SequenceMatcher(None, needle, hay)

? Also, for the record, the last element in the list returned by get_matching_blocks() is a dummy of format (len(a), len(b), 0).

Maybe it's just mistake in the pasting, but then please update your question with actual code (I'm thinking of SequenceMatcher() method)

SequenceMatcher has broken "junk heuristic" - if second string is at least 200 chars long, a junk is every letter which (count-1) is more than 1% of total length. Comment from official bug ticket (namely this comment):

The reason for the bug is the heuristic: if the second sequence is at least 200 items long then any item occurring more than one percent of the time in the second sequence is treated as junk. This was aimed at recurring code lines like 'else:' and 'return', but can be fatal for small alphabets where common items are necessary content.

I will also let myself quote code examples provided by author of above:

Here len(a) == 200, len(b) == 199:

>>> print(SM(None, 'x' + 'y'*199, 'y'*199).ratio())
>>> 0.9975 #correct

Here len(a) == 199, len(b) == 200 (we switch a and b):

>>> print(SM(None, 'y'*199, 'x' + 'y'*199).ratio())
>>> 0 #wrong

Where obviously both cases should give the same output.

The bug was fixed by adding optional parameter I mentioned, autojunk, which - for correct behaviour here - should be manually set to False.

edited Aug 27, 2012 at 15:36

answered Aug 27, 2012 at 13:09

wasyl

3,5163 gold badges29 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Abhijit Over a year ago

Thanks for pointing it out. I have updated the question. And I am aware of the fact the the last element in the list returned is a dummy

wasyl Over a year ago

Okay my bad, running it in interpreter gives same result. However switching parameters produces correct output, maybe this will point someone towards answer

wasyl Over a year ago

Okay, another update: this has something to do with autojunk parameter. Adding False as fourth argument of SequenceMatcher makes output correct

wasyl Over a year ago

I guess this parameter is generally thought of as "controversial", as it doesn't always work as intended. It might have made situation that some letter is described as junk (when it shouldn't be) and thus break search. Unfortunately I can't find decent documentation on this specific parameter and how does this junk heuristic work, and don't have time to rummage through difflib code at the moment...

wasyl Over a year ago

@Abhijit I have updated my answer based on what I found, hope it satisfies you. tl;dr: It's implementation "feature"

|

Collectives™ on Stack Overflow

Unexplaned behavior with difflib.SequenceMatcher get_matching_blocks()

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related