0

what i have already known:

b'\xce\xb8'.decode('UTF-8') gives 'θ', because decode() function is designed for doing this job - decoding the bytes.

what i want to know is, dose python3 shell mode have some default config to control following behavior (Python3) .

>>> sys.getdefaultencoding()
'utf-8'
>>> b'\xce\xb8'.decode()
'θ'
>>> b'\xce\xb8'
b'\xce\xb8'
>>> b'\x41'
b'A'
>>> print(b'\xce\xb6')
b'\xce\xb6'
>>> print(b'\xce\xb6'.decode('utf8'))
ζ

it seems like shell mode use ASCII as default encoding rather than utf8.

the question is, is this true? if yes, what the path where the config is located in?

2 Answers 2

2

This has nothing to do with the encoding. Python is just showing you in the shell what the value is that you just gave it, in a more literal sense. Try this instead:

a = b'\xce\xb8'
print(a)

result:

θ

So 'a' is indeed encoded as UTF-8, just as you expected. You're just misinterpreting what Python is echoing back to the console.

BTW, you're also I think not doing what you think you are with the 'b' prefix. It appears you're using Python 2.X. In that version of Python, the 'b' prefix is ignored. I know that because it doesn't show up in the echoed result. See here:

Python 2.x:

>>> b'\xce\xb8'
'\xce\xb8'

Python 3.X

>>> b'\xce\xb8'
b'\xce\xb8'

So in Python 2.X, you'll get the same result with and without the 'b'. In Python 3.X, you get different behavior either way than what you get in Python 2.X. I haven't done much with Python 3.X, but I believe that this is because how strings are represented changed in 3.X.

PS: If you really just care how Python is echoing strings back to you, I don't know that there's a way to change that. I wonder, however, why that matters to you.

Sign up to request clarification or add additional context in comments.

4 Comments

When I try your first code snippet (to print(a)) in Python 3.7, I get b'\xce\xb8'.
Ha! You're right! I must have made a mistake when I thought I'd tried that. That actually makes more sense to me. I wondered for a moment why Python would convert that back to a character when printing it, but I figured that's just how it worked. All the more reason to point out to the OP that he's using the wrong version of Python for using 'b'. - I'll update my answer. Thanks!
In Python 2, byte strings are the default, so b'' == '' (b is optional). In Python 3, Unicode strings are the default so u'' == '' (u is optional).
Ha. Thanks Mark! Is there any way to get the default Python 2 behavior in Python 3? Since using 'b' doesn't give the same result for P2 as for P3, that must not be it.
1

Python 3 represents bytes as the equivalent ASCII character if the value of the byte is within the ASCII range, otherwise it displays the escaped hex value.

From the docs for the byte type:

Only ASCII characters are permitted in bytes literals (regardless of the declared source code encoding). Any binary values over 127 must be entered into bytes literals using the appropriate escape sequence.

This is a deliberate design decision (from the same doc)

to emphasise that while many binary formats include ASCII based elements and can be usefully manipulated with some text-oriented algorithms, this is not generally the case for arbitrary binary data

The interpreter doesn't display characters for bytes outside the ASCII range because it cannot know whether the bytes are encoded as UTF-8, some other encoding, or even if they represent text data at all.

As user Steve points out in their answer, this behaviour is not related to encoding. It is not configurable; if you want to see the characters corresponding to a UTF-8 encoded bytestring, decode to str.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.