0

I am currently trying to figure out how to use Unicode in a regex in Python.

The regex I want to get to work is the following:

r"([A-ZÜÖÄß]+\s)+"

This should include all occurences of multiple capitalized words, that may or may not have Umlauts in them. Funnily enouth it will do nearly what I wanted, but it still ignores Umlauts.

For example, in FUßBALL AND MORE only BALL AND MORE should be detected.

I already tried to simply use the Unicode representations (Ü becomes \u00DC etc.), as it was advised in another thread, but that does not work too. Instead I might try to use the "regex" library instead of "re", but I kindoff want to know what I am doing wrong right now.

If you are able to enlighten me, please feel free to do so.

5
  • Well that makes sense, yes I am using Python version 2.7.12 ----- Cool. That does mean that I don't misunderstand regexes (I feared to just have produced a realy stupid regex ;D ) Commented Oct 5, 2017 at 9:10
  • Replacing the Chars with their ISO representation worked like a charm. ---> r'(?:[A-Z\xC4\xD6\xDC\xDF]+\s)+' Do you mind posting your comment as an answer? Then I could accept that and close the question. Thank you a lot, by the way! Commented Oct 5, 2017 at 9:48
  • I'll look over it as soon as I am back at my workdesk. I can't upvote you any more. Somebody must have downvoted your stuff - for reasons i suppose... Commented Oct 8, 2017 at 11:33
  • Yes. Adding the 'u' seems to work well. I changed the answer status accordingly. Commented Oct 9, 2017 at 6:44
  • So, that means it is another duplicate of a very popular question. Closed as such. Commented Oct 9, 2017 at 6:49

1 Answer 1

0

Use Unicode strings. Make sure your source is saved in the declared encoding:

#coding:utf8
import re

for s in re.finditer(ur"[A-ZÜÖÄß]+",u"FUßBALL AND MORE"):
    print s.group()

Output:

FUßBALL
AND
MORE

Without Unicode strings, your byte strings are in the encoding of your source file. If that is UTF-8, they are multi-byte for non-ASCII. You will still have problems with Unicode strings in a narrow Python build, but only if you use Unicode codepoints >U+FFFF (such as emoji) as they will be encoded using UTF-16 surrogates (two codepoints). In that case, switch to the latest Python 3.x where the problem was solved and all Unicode codepoints have a length of 1.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.