5
# -*- coding: utf-8 -*-

a = 'éáűőúöüó€'
print type(a)    # <type 'str'>
print a          # éáűőúöüó€
print ord(a[-1]) # 172

Why is this working ? Shouldn't be this a SyntaxError: Non-ASCII character '\xc3' in file ... ? There are unicode literals in the string.

When I prefix it with u, the results are different:

# -*- coding: utf-8 -*-

a = u'éáűőúöüó€'
print type(a)    # <type 'unicode'>
print a          # éáűőúöüó€
print ord(a[-1]) # 8364

Why? What is the difference between the internal representations in python ? How can I see it myself ? :)

6
  • Why should it be a syntax error to have bytes in a byte string? Commented Feb 12, 2013 at 18:15
  • 2
    The first is a str object containing the UTF-8 bytes that are in the file. The second is a unicode object formed by decoding the UTF-8. Use repr() to see the difference. Commented Feb 12, 2013 at 18:19
  • 2
    Check the length of the string in the first case. Commented Feb 12, 2013 at 18:20
  • 6
    Why the downvotes? This seems like a legitimate question. To ask any clearer would require knowledge of the answer. Commented Feb 12, 2013 at 18:22
  • 2
    FYI, this is fixed in Python 3. Commented Feb 12, 2013 at 18:49

1 Answer 1

11

There are unicode literals in the string

No, there are not. There are bytes in the string. Python simply goes with the bytes your editor saved to disk when you created the file.

When you prefixed the string with a u'', you signalled to python that you are creating a unicode object instead. Python now pays attention to the encoding you specified at the top of your source file, and it decodes the bytes in the source file to a unicode object based on the encoding you specified.

In both cases, your editor saved a series of bytes to a file, for the character, the UTF-8 encoding is three bytes, represented in hexadecimal as E282AC. The last byte in the bytestring is thus AC, or 172 in decimal. Once you decode the last 3 bytes as UTF-8, they together become the Unicode codepoint U+20AC, which is 8364 in decimal.

You really should read up on Python and Unicode:

Sign up to request clarification or add additional context in comments.

1 Comment

Wow, great answer thanks ! I already read the first link still didn't understand the difference. Now it's crystal clear ! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.