How does the string formatting % in python works with unicodes?

Question

I have a question about unicodes and the string formatting % in python. I have the following four cases:

case:

# -*- encoding: utf -*-
print '%s' % 'München'

case:

# -*- encoding: utf -*-
print '%s' % u'München'

case:

# -*- encoding: utf -*-
print u'%s' % u'München'

case:

# -*- encoding: utf -*-
print u'%s' % 'München'

Cases 1-3 work fine but in case 4 I get the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

So my questions are: why do the cases 1-3 work (especially case 2) and why does case 4 fail?

I know how to fix my problem but I want to understand why this problem happens, so I would be happy if someone could help me. Thanks!

PS: Thanks for the links to possible duplicates but sadly my problems aren't solved by Why does Python 2.x throw an exception with string formatting + unicode? because in this they don't use a unicode for the to-be-formated-string. So they do cases 1 and 2 but not 4, and especially case 2 does work for me and breaks for them...

Possible duplicate of Why does Python 2.x throw an exception with string formatting + unicode? — Sayse
– Sayse, Commented Feb 14, 2017 at 7:24

Mark Tolonen · Accepted Answer · 2017-02-14 16:38:22Z

1

In cases 2 and 4, the non-Unicode string is being coerced to Unicode implicitly using the default ascii codec. In case 2 '%s' can be converted to Unicode with that codec, but in case 4 'München' cannot.

In cases 1 and 3, both are byte strings or both are Unicode strings so no coercion is required.

answered Feb 14, 2017 at 16:38

Mark Tolonen

181k26 gold badges183 silver badges279 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user7561458 Over a year ago

Thanks for your answer but in case 2 and 4 why can '%s' and why can't 'München converted to Unicode with that codec? Could you explain me what is going on in the background?

Mark Tolonen Over a year ago

@R.Kayze % and s are ASCII characters, ü is not. If a character is not a member of the character set being decoded, it can't be converted to Unicode.

user7561458 Over a year ago

Thanks again, I think I understood cases 2-4. But why does the umlaut 'ü' work in case 1? I am working with python 2.7 and if s is an ASCII character and ü is not then s just needs one byte and ü two.

Mark Tolonen Over a year ago

@R.Kayze In case 1, the strings are just bytes. There is no coercion to Unicode for either string so no possibility of UnicodeDecodeError. The bytes are just inserted into the format string as is, whatever they are.

Collectives™ on Stack Overflow

How does the string formatting % in python works with unicodes?

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related