Comparing string uniforming special characters in python [duplicate]

Question

Probabily I can use a better english but what I want is ignoring accent (and like) in words so:

renè, rené, rene' and rene should be the same so should

mañana and manana or

even-distribuited and even distribuited and possibly

shouldn't and shouldnt

I remember a function (derivated from journalism) used for example for internet page addresses that should take out spaces, accent etc but I don't remember the name. I think it should works but other way are accepted

Thank you

Edit:

The function I had in mind is Slugfy() for Django but probabily is not enough

For the second part you can just re.sub out the characters you don't want — David
– David, Commented Jul 24, 2024 at 7:54

Romain · Accepted Answer · 2024-07-24 08:31:54Z

The standard approach to get rid of special chars seems to be discussed in this question. But maybe you could consider another approach often called fuzzy matching (or fuzzy search).

[...] technique of finding strings that match a pattern approximately (rather than exactly)

In Python you can use TheFuzz to do that. Here is a try based on your examples.

from thefuzz import fuzz

tuples = [("mañana", "manana"), ("shouldn't", "shouldnt"), ("even-distribuited", "even distribuited")]

for tuple in tuples:
  print(f"{tuple[0]} vs {tuple[1]}: {fuzz.ratio(tuple[0], tuple[1])}")

# mañana vs manana: 83
# shouldn't vs shouldnt: 94
# even-distribuited vs even distribuited: 94

So you could define a rule based on the ratio to conclude that there is a match between two strings.

You could even combine unicode normalization and fuzzy matching for better results.

tuples = [("mañana", "manana"), ("shouldn't", "shouldnt"), ("even-distribuited", "even distribuited")]

def compare(tuples, unicode=True):
  for t in tuples:
    if unicode:
      t = tuple(map(lambda x: unicodedata.normalize(u'NFKD', x).encode('ascii', 'ignore').decode('utf8'), t))
    print(f"{t[0]} vs {t[1]}: {fuzz.ratio(t[0], t[1])}")

compare(tuples)

# manana vs manana: 100
# shouldn't vs shouldnt: 94
# even-distribuited vs even distribuited: 94

Collectives™ on Stack Overflow

Comparing string uniforming special characters in python [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related