2

I have the following piece of code. The last line throws an error. Why is that?

class Foo(object):

    def __unicode__(self):
        return u'\u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45'

    def __str__(self):
        return self.__unicode__().encode('utf-8')

print "this works %s" % (u'asdf')
print "this works %s" % (Foo(),)
print "this works %s %s" % (Foo(), 'asdf')
print

print "this also works {0} {1}".format(Foo(), u'asdf')
print
print "this should break %s %s" % (Foo(), u'asdf')

The error is "UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 18: ordinal not in range(128)"

1 Answer 1

3

Python 2 implicitly will try and encode unicode values to strings when you mix unicode and string objects, or it will try and decode byte strings to unicode.

You are mixing unicode, byte strings and a custom object, and you are triggering a sequence of encodings and decodings that doesn't mix.

In this case, your Foo() value is interpolated as a string (str(Foo()) is used), and the u'asdf' interpolation triggers a decode of the template so far (so with the UTF-8 Foo() value) to interpolate the unicode string. This decode fails as the ASCII codec cannot decode the \xe6\x9e\x97 UTF-8 byte sequence already interpolated.

You should always explicitly encode Unicode values to bytestrings or decode byte strings to Unicode before mixing types, as the corner cases are complex.

Explicitly converting to unicode() works:

>>> print "this should break %s %s" % (unicode(Foo()), u'asdf')
this should break 林覺民謝冰心故居 asdf

as the output is turned into a unicode string:

>>> "this should break %s %s" % (unicode(Foo()), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'

while otherwise you'd end up with a byte string:

>>> "this should break %s %s" % (Foo(), 'asdf')
'this should break \xe6\x9e\x97\xe8\xa6\xba\xe6\xb0\x91\xe8\xac\x9d\xe5\x86\xb0\xe5\xbf\x83\xe6\x95\x85\xe5\xb1\x85 asdf'

(note that asdf is left a bytestring too).

Alternatively, use a unicode template:

>>> u"this should break %s %s" % (Foo(), u'asdf')
u'this should break \u6797\u89ba\u6c11\u8b1d\u51b0\u5fc3\u6545\u5c45 asdf'
Sign up to request clarification or add additional context in comments.

6 Comments

That explains most of it. I know the difference between "" an u"" so I see that there are many good ways to fix this. Does the fact that the .format() method works come from it just being a different piece of code that handles this more gracefully? I guess my main question is whether the code as I wrote it should work and this is a bug in Python or is this the correct behavior?
@ipartola: str.format() is a different piece of code that handles coercion different (it will never coerce the template itself, only ever the arguments to the str.format() method).
@ipartola: The automatic implicit coercion was done away with in Python 3 as it was too confusing. That leads to other confusions (some people don't like it that Python 3 won't compare str and bytes values anymore) but I feel it's definitely for the best.
Python3 definitely does this better. So is this a bug in Python 2 that I should be reporting or is this just how it's supposed to work? Is there some place that documents this behavior?
It is not a bug, the behaviour was intentional before it became clear how many problems this led to. The coercion rules are also underdocumented, exacerbating the problem. The specific behaviour here is partly documented but your use of a custom object confused matters a bit; had you used a UTF8 encoded byte string instead would have led to the same error. Bottom line: do as the Unicode howto tells you and avoid mixing bye strings and Unicode.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.