3

I have a question about unicodes and the string formatting % in python. I have the following four cases:

  1. case:

    # -*- encoding: utf -*-
    print '%s' % 'München'
    
  2. case:

    # -*- encoding: utf -*-
    print '%s' % u'München'
    
  3. case:

    # -*- encoding: utf -*-
    print u'%s' % u'München'
    
  4. case:

    # -*- encoding: utf -*-
    print u'%s' % 'München'
    

Cases 1-3 work fine but in case 4 I get the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

So my questions are: why do the cases 1-3 work (especially case 2) and why does case 4 fail?

I know how to fix my problem but I want to understand why this problem happens, so I would be happy if someone could help me. Thanks!

PS: Thanks for the links to possible duplicates but sadly my problems aren't solved by Why does Python 2.x throw an exception with string formatting + unicode? because in this they don't use a unicode for the to-be-formated-string. So they do cases 1 and 2 but not 4, and especially case 2 does work for me and breaks for them...

2

1 Answer 1

1

In cases 2 and 4, the non-Unicode string is being coerced to Unicode implicitly using the default ascii codec. In case 2 '%s' can be converted to Unicode with that codec, but in case 4 'München' cannot.

In cases 1 and 3, both are byte strings or both are Unicode strings so no coercion is required.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your answer but in case 2 and 4 why can '%s' and why can't 'München converted to Unicode with that codec? Could you explain me what is going on in the background?
@R.Kayze % and s are ASCII characters, ü is not. If a character is not a member of the character set being decoded, it can't be converted to Unicode.
Thanks again, I think I understood cases 2-4. But why does the umlaut 'ü' work in case 1? I am working with python 2.7 and if s is an ASCII character and ü is not then s just needs one byte and ü two.
@R.Kayze In case 1, the strings are just bytes. There is no coercion to Unicode for either string so no possibility of UnicodeDecodeError. The bytes are just inserted into the format string as is, whatever they are.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.