making difflib's SequenceMatcher ignore "junk" characters

Question

I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's SequenceMatcher great for this task as it was simple and found the results good. But if i compare hellboy and hell-boy like this

>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335

I want such words to give a 100 percent match i.e ratio of 1.0. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher to ignore some "junk" characters for comparison purpose?

It's kind of hackish, but any reason you couldn't just remove the junk characters before doing the comparison? It's essentially the same thing as ignoring them. — Gareth Latty
– Gareth Latty, Commented Apr 2, 2012 at 20:58
yes thats good but i wanted to figure out if i could just do some difflib magic and get away with it otherwise i would have to pass the string through another function to first remove all junk chars. — lovesh
– lovesh, Commented Apr 2, 2012 at 21:09

Community · Accepted Answer · 2017-05-23 12:17:17Z

4

If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use str.translate().

E.g:

to_compare = to_compare.translate(None, {"-"})

As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex.

Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the delchars parameter is not accepted. In this case, you simply need to make a mapping to None. E.g:

translation_map = str.maketrans({"-": None})
to_compare = to_compare.translate(translation_map)

You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through:

def to_translation_map(iterable):
    return {key: None for key in iterable}
    #return dict((key, None) for key in iterable) #For old versions of Python without dict comps.

edited May 23, 2017 at 12:17

CommunityBot

11 silver badge

answered Apr 3, 2012 at 10:16

Gareth Latty

89.5k18 gold badges184 silver badges187 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

BenMorel · Accepted Answer · 2013-12-07 16:42:55Z

1

If you were to make a function to remove all the junk character before hand you could use re:

string=re.sub('-|_|\*','',string)

for the regular expression '-|_|\*' just put a | between all junk characters and if its a special re character put a \ before it (like * and +)

edited Dec 7, 2013 at 16:42

BenMorel

37k52 gold badges208 silver badges339 bronze badges

answered Apr 3, 2012 at 0:39

apple16

1,14710 silver badges13 bronze badges

1 Comment

Sam Rockett Over a year ago

Is -|_|\* better than using [-_*] or are they equal efficiency wise?

Collectives™ on Stack Overflow

making difflib's SequenceMatcher ignore "junk" characters

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related