7

I use a tokenizer to split french sentences into words and had problems with words containing the french character â.

I tried to isolate the problem and it eventually boiled down to this simple fact:

>>> re.match(r"’", u'â', re.U)
>>> re.match(r"[’]", u'â', re.U)
<_sre.SRE_Match object at 0x21d41d0>

â is matched by a pattern containing if it's put in an ensemble matcher.

Is there something wrong on my part regarding UTF-8 handling or is it a bug?

My python version is:

Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
[GCC 4.7.2] on linux2

EDIT:

Hum, embarassingly enough, it seems that replacing the r prefixing the pattern with a u fixes the issue.

I wonder why the official documentation uses extensively r then :((

2
  • r is correct and important. You should add u (see the answer) instead of replacing r. Commented Apr 17, 2013 at 18:50
  • @m.buettner: yup I edited before seeing the answer. I went on and checked what r and u do, and indeed both are important. Thanks :) Commented Apr 17, 2013 at 19:41

1 Answer 1

7

Your pattern should be a unicode string too:

 >>> re.match(ur"’", u'â', re.U)
 >>> re.match(ur"[’]", u'â', re.U)

Otherwise apparently sre encodes â to latin-1 and finds the resulting byte in the three bytes that is a utf-8 .

"[’]" is equivalent to "[\xe2\x80\x99]", and u'â'.encode('latin-1') is \xe2.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks for the hint, I noticed it just before you answered :)
re.U does not magically turn on unicode, it just changes the meaning of \w.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.