how to remove just the accents, but not umlauts from strings in Python

Question

I'm using following code

import unicodedata
def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
              if unicodedata.category(c) != 'Mn')
strip_accents('ewaláièÜÖ')

which returns

'ewalaieUO'

But I want it to return

'ewalaieÜÖ'

Is there any easier way than replacing the characters with str.replace(char_a,char_b) ? How can I handle this efficiently ?

running.t · Accepted Answer · 2017-06-15 21:56:02Z

3

So let's start with your test input:

In [1]: test
Out[1]: 'ewaláièÜÖ'

See what's happening with it when normalizing:

In [2]: [x for x in unicodedata.normalize('NFD', test)]
Out[2]: ['e', 'w', 'a', 'l', 'a', '́', 'i', 'e', '̀', 'U', '̈', 'O', '̈']

And here are unicodedata categories of each normalized elements:

In [3]: [unicodedata.category(x) for x in unicodedata.normalize('NFD', test)]
Out[3]: ['Ll', 'Ll', 'Ll', 'Ll', 'Ll', 'Mn', 'Ll', 'Ll', 'Mn', 'Lu', 'Mn', 'Lu', 'Mn']

As you can see, not only "accents", but also "umlauts" are in category Mn. So what you can use instead of unicodedata.category is unicodedata.name

In [4]: [unicodedata.name(x) for x in unicodedata.normalize('NFD', test)]
Out[4]: ['LATIN SMALL LETTER E',
 'LATIN SMALL LETTER W',
 'LATIN SMALL LETTER A',
 'LATIN SMALL LETTER L',
 'LATIN SMALL LETTER A',
 'COMBINING ACUTE ACCENT',
 'LATIN SMALL LETTER I',
 'LATIN SMALL LETTER E',
 'COMBINING GRAVE ACCENT',
 'LATIN CAPITAL LETTER U',
 'COMBINING DIAERESIS',
 'LATIN CAPITAL LETTER O',
 'COMBINING DIAERESIS']

Here accents names are COMBINING ACUTE/GRAVE ACCENT, and "umlauts" names are COMBINING DIAERESIS. So here is my suggestion, how to fix your code:

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
              if not unicodedata.name(c).endswith('ACCENT')) 

strip_accents(test)
'ewalaieÜÖ'

Also as you can read from unicodedata documentation this module is just a wrapper for database available here, so please take a look at list of names from that database to make sure this covers all cases you need.

edited Jun 15, 2017 at 21:56

answered Jun 15, 2017 at 21:48

running.t

5,7395 gold badges37 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user27772 Over a year ago

For those being interested for a java 11+ solution I came up with Normalizer.normalize(s, Normalizer.Form.NFKD).replaceAll("((?<![aouAOU])\\p{M})|((?<=[aouAOU])[\\p{M}&&[^\u0308]])", ""). First, unicode normalization is applied into a decomposed form to separate secondary chars (e. g. accents) from primary chars (like "a"). Then, we remove all secondary chars not being preceded by [aouAOU] as well as all secondary chars except \u0308 (Combining Diaeresis) being preceded by [aouAOU]. The result can be composed again using Normalizer.normalize(s, Normalizer.Form.NFKC);.

Collectives™ on Stack Overflow

how to remove just the accents, but not umlauts from strings in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related