3

I'm using following code

import unicodedata
def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
              if unicodedata.category(c) != 'Mn')
strip_accents('ewaláièÜÖ')

which returns

'ewalaieUO'

But I want it to return

'ewalaieÜÖ'

Is there any easier way than replacing the characters with str.replace(char_a,char_b) ? How can I handle this efficiently ?

1 Answer 1

3

So let's start with your test input:

In [1]: test
Out[1]: 'ewaláièÜÖ'

See what's happening with it when normalizing:

In [2]: [x for x in unicodedata.normalize('NFD', test)]
Out[2]: ['e', 'w', 'a', 'l', 'a', '́', 'i', 'e', '̀', 'U', '̈', 'O', '̈']

And here are unicodedata categories of each normalized elements:

In [3]: [unicodedata.category(x) for x in unicodedata.normalize('NFD', test)]
Out[3]: ['Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll', 'Ll', 'Mn', 'Lu', 'Mn', 'Lu', 'Mn']

As you can see, not only "accents", but also "umlauts" are in category Mn. So what you can use instead of unicodedata.category is unicodedata.name

In [4]: [unicodedata.name(x) for x in unicodedata.normalize('NFD', test)]
Out[4]: ['LATIN SMALL LETTER E',
 'LATIN SMALL LETTER W',
 'LATIN SMALL LETTER A',
 'LATIN SMALL LETTER L',
 'LATIN SMALL LETTER A',
 'COMBINING ACUTE ACCENT',
 'LATIN SMALL LETTER I',
 'LATIN SMALL LETTER E',
 'COMBINING GRAVE ACCENT',
 'LATIN CAPITAL LETTER U',
 'COMBINING DIAERESIS',
 'LATIN CAPITAL LETTER O',
 'COMBINING DIAERESIS']

Here accents names are COMBINING ACUTE/GRAVE ACCENT, and "umlauts" names are COMBINING DIAERESIS. So here is my suggestion, how to fix your code:

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
              if not unicodedata.name(c).endswith('ACCENT')) 

strip_accents(test)
'ewalaieÜÖ'

Also as you can read from unicodedata documentation this module is just a wrapper for database available here, so please take a look at list of names from that database to make sure this covers all cases you need.

Sign up to request clarification or add additional context in comments.

1 Comment

For those being interested for a java 11+ solution I came up with Normalizer.normalize(s, Normalizer.Form.NFKD).replaceAll("((?<![aouAOU])\\p{M})|((?<=[aouAOU])[\\p{M}&&[^\u0308]])", ""). First, unicode normalization is applied into a decomposed form to separate secondary chars (e. g. accents) from primary chars (like "a"). Then, we remove all secondary chars not being preceded by [aouAOU] as well as all secondary chars except \u0308 (Combining Diaeresis) being preceded by [aouAOU]. The result can be composed again using Normalizer.normalize(s, Normalizer.Form.NFKC);.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.