Why does Python 2.x throw an exception with string formatting + unicode?

Question

I have the following piece of code. The last line throws an error. Why is that?

class Foo(object):

    def __unicode__(self):
        return u'\u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45'

    def __str__(self):
        return self.__unicode__().encode('utf-8')

print "this works %s" % (u'asdf')
print "this works %s" % (Foo(),)
print "this works %s %s" % (Foo(), 'asdf')
print

print "this also works {0} {1}".format(Foo(), u'asdf')
print
print "this should break %s %s" % (Foo(), u'asdf')

The error is "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 18: ordinal not in range(128)"

Martijn Pieters · Accepted Answer · 2014-03-20 15:48:42Z

3

Python 2 implicitly will try and encode unicode values to strings when you mix unicode and string objects, or it will try and decode byte strings to unicode.

You are mixing unicode, byte strings and a custom object, and you are triggering a sequence of encodings and decodings that doesn't mix.

In this case, your Foo() value is interpolated as a string (str(Foo()) is used), and the u'asdf' interpolation triggers a decode of the template so far (so with the UTF-8 Foo() value) to interpolate the unicode string. This decode fails as the ASCII codec cannot decode the \xe6\x9e\x97 UTF-8 byte sequence already interpolated.

You should always explicitly encode Unicode values to bytestrings or decode byte strings to Unicode before mixing types, as the corner cases are complex.

Explicitly converting to unicode() works:

>>> print "this should break %s %s" % (unicode(Foo()), u'asdf')
this should break 林覺民謝冰心故居 asdf

as the output is turned into a unicode string:

>>> "this should break %s %s" % (unicode(Foo()), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'

while otherwise you'd end up with a byte string:

>>> "this should break %s %s" % (Foo(), 'asdf')
'this should break \xe6\x9e\x97\xe8\xa6\xba\xe6\xb0\x91\xe8\xac\x9d\xe5\x86\xb0\xe5\xbf\x83\xe6\x95\x85\xe5\xb1\x85 asdf'

(note that asdf is left a bytestring too).

Alternatively, use a unicode template:

>>> u"this should break %s %s" % (Foo(), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'

edited Mar 20, 2014 at 15:48

answered Mar 20, 2014 at 15:35

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

ipartola Over a year ago

That explains most of it. I know the difference between "" an u"" so I see that there are many good ways to fix this. Does the fact that the .format() method works come from it just being a different piece of code that handles this more gracefully? I guess my main question is whether the code as I wrote it should work and this is a bug in Python or is this the correct behavior?

Martijn Pieters Over a year ago

@ipartola: str.format() is a different piece of code that handles coercion different (it will never coerce the template itself, only ever the arguments to the str.format() method).

Martijn Pieters Over a year ago

@ipartola: The automatic implicit coercion was done away with in Python 3 as it was too confusing. That leads to other confusions (some people don't like it that Python 3 won't compare str and bytes values anymore) but I feel it's definitely for the best.

ipartola Over a year ago

Python3 definitely does this better. So is this a bug in Python 2 that I should be reporting or is this just how it's supposed to work? Is there some place that documents this behavior?

Martijn Pieters Over a year ago

It is not a bug, the behaviour was intentional before it became clear how many problems this led to. The coercion rules are also underdocumented, exacerbating the problem. The specific behaviour here is partly documented but your use of a custom object confused matters a bit; had you used a UTF8 encoded byte string instead would have led to the same error. Bottom line: do as the Unicode howto tells you and avoid mixing bye strings and Unicode.

|

Collectives™ on Stack Overflow

Why does Python 2.x throw an exception with string formatting + unicode?

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related