7

I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's SequenceMatcher great for this task as it was simple and found the results good. But if i compare hellboy and hell-boy like this

>>> sm=SequenceMatcher(lambda x:x=='-','hellboy','hell-boy')
>>> sm.ratio()
0: 0.93333333333333335

I want such words to give a 100 percent match i.e ratio of 1.0. I understand that the junk character specified in the function above are not used for comparison but finding longest contiguous matching subsequence. Is there some way i can make SequenceMatcher to ignore some "junk" characters for comparison purpose?

2
  • 3
    It's kind of hackish, but any reason you couldn't just remove the junk characters before doing the comparison? It's essentially the same thing as ignoring them. Commented Apr 2, 2012 at 20:58
  • yes thats good but i wanted to figure out if i could just do some difflib magic and get away with it otherwise i would have to pass the string through another function to first remove all junk chars. Commented Apr 2, 2012 at 21:09

2 Answers 2

4

If you wish to do as I suggested in the comments, (removing the junk characters) the fastest method is to use str.translate().

E.g:

to_compare = to_compare.translate(None, {"-"})

As shown here, this is significantly (3x) faster (and I feel nicer to read) than a regex.

Note that under Python 3.x, or if you are using Unicode under Python 2.x, this will not work as the delchars parameter is not accepted. In this case, you simply need to make a mapping to None. E.g:

translation_map = str.maketrans({"-": None})
to_compare = to_compare.translate(translation_map)

You could also have a small function to save some typing if you have a lot of characters you want to remove, just make a set and pass through:

def to_translation_map(iterable):
    return {key: None for key in iterable}
    #return dict((key, None) for key in iterable) #For old versions of Python without dict comps.
Sign up to request clarification or add additional context in comments.

Comments

1

If you were to make a function to remove all the junk character before hand you could use re:

string=re.sub('-|_|\*','',string)

for the regular expression '-|_|\*' just put a | between all junk characters and if its a special re character put a \ before it (like * and +)

1 Comment

Is -|_|\* better than using [-_*] or are they equal efficiency wise?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.