Python extremely puzzling regex unicode behaviour

Question

I use a tokenizer to split french sentences into words and had problems with words containing the french character â.

I tried to isolate the problem and it eventually boiled down to this simple fact:

>>> re.match(r"’", u'â', re.U)
>>> re.match(r"[’]", u'â', re.U)
<_sre.SRE_Match object at 0x21d41d0>

â is matched by a pattern containing ’ if it's put in an ensemble matcher.

Is there something wrong on my part regarding UTF-8 handling or is it a bug?

My python version is:

Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
[GCC 4.7.2] on linux2

EDIT:

Hum, embarassingly enough, it seems that replacing the r prefixing the pattern with a u fixes the issue.

I wonder why the official documentation uses extensively r then :((

r is correct and important. You should add u (see the answer) instead of replacing r. — Martin Ender
– Martin Ender, Commented Apr 17, 2013 at 18:50
@m.buettner: yup I edited before seeing the answer. I went on and checked what r and u do, and indeed both are important. Thanks :) — m09
– m09, Commented Apr 17, 2013 at 19:41

Pavel Anossov · Accepted Answer · 2013-04-17 18:49:54Z

7

Your pattern should be a unicode string too:

 >>> re.match(ur"’", u'â', re.U)
 >>> re.match(ur"[’]", u'â', re.U)

Otherwise apparently sre encodes â to latin-1 and finds the resulting byte in the three bytes that is a utf-8 ’.

"[’]" is equivalent to "[\xe2\x80\x99]", and u'â'.encode('latin-1') is \xe2.

answered Apr 17, 2013 at 18:49

Pavel Anossov

63.3k16 gold badges156 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

thanks for the hint, I noticed it just before you answered :)

re.U does not magically turn on unicode, it just changes the meaning of \w.