python3 shell mode can output an utf-8 character for some bytes and cannot for some other, what is the reason?

Question

what i have already known:

b'\xce\xb8'.decode('UTF-8') gives 'θ', because decode() function is designed for doing this job - decoding the bytes.

what i want to know is, dose python3 shell mode have some default config to control following behavior (Python3) .

>>> sys.getdefaultencoding()
'utf-8'
>>> b'\xce\xb8'.decode()
'θ'
>>> b'\xce\xb8'
b'\xce\xb8'
>>> b'\x41'
b'A'
>>> print(b'\xce\xb6')
b'\xce\xb6'
>>> print(b'\xce\xb6'.decode('utf8'))
ζ

it seems like shell mode use ASCII as default encoding rather than utf8.

the question is, is this true? if yes, what the path where the config is located in?

CryptoFool · Accepted Answer · 2019-03-10 05:30:20Z

2

This has nothing to do with the encoding. Python is just showing you in the shell what the value is that you just gave it, in a more literal sense. Try this instead:

a = b'\xce\xb8'
print(a)

result:

θ

So 'a' is indeed encoded as UTF-8, just as you expected. You're just misinterpreting what Python is echoing back to the console.

BTW, you're also I think not doing what you think you are with the 'b' prefix. It appears you're using Python 2.X. In that version of Python, the 'b' prefix is ignored. I know that because it doesn't show up in the echoed result. See here:

Python 2.x:

>>> b'\xce\xb8'
'\xce\xb8'

Python 3.X

>>> b'\xce\xb8'
b'\xce\xb8'

So in Python 2.X, you'll get the same result with and without the 'b'. In Python 3.X, you get different behavior either way than what you get in Python 2.X. I haven't done much with Python 3.X, but I believe that this is because how strings are represented changed in 3.X.

PS: If you really just care how Python is echoing strings back to you, I don't know that there's a way to change that. I wonder, however, why that matters to you.

edited Mar 10, 2019 at 5:30

answered Mar 10, 2019 at 4:59

CryptoFool

23.4k5 gold badges31 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

combinatorist Over a year ago

When I try your first code snippet (to print(a)) in Python 3.7, I get b'\xce\xb8'.

CryptoFool Over a year ago

Ha! You're right! I must have made a mistake when I thought I'd tried that. That actually makes more sense to me. I wondered for a moment why Python would convert that back to a character when printing it, but I figured that's just how it worked. All the more reason to point out to the OP that he's using the wrong version of Python for using 'b'. - I'll update my answer. Thanks!

Mark Tolonen Over a year ago

In Python 2, byte strings are the default, so b'' == '' (b is optional). In Python 3, Unicode strings are the default so u'' == '' (u is optional).

CryptoFool Over a year ago

Ha. Thanks Mark! Is there any way to get the default Python 2 behavior in Python 3? Since using 'b' doesn't give the same result for P2 as for P3, that must not be it.

snakecharmerb · Accepted Answer · 2019-03-11 07:55:51Z

Python 3 represents bytes as the equivalent ASCII character if the value of the byte is within the ASCII range, otherwise it displays the escaped hex value.

From the docs for the byte type:

Only ASCII characters are permitted in bytes literals (regardless of the declared source code encoding). Any binary values over 127 must be entered into bytes literals using the appropriate escape sequence.

This is a deliberate design decision (from the same doc)

to emphasise that while many binary formats include ASCII based elements and can be usefully manipulated with some text-oriented algorithms, this is not generally the case for arbitrary binary data

The interpreter doesn't display characters for bytes outside the ASCII range because it cannot know whether the bytes are encoded as UTF-8, some other encoding, or even if they represent text data at all.

As user Steve points out in their answer, this behaviour is not related to encoding. It is not configurable; if you want to see the characters corresponding to a UTF-8 encoded bytestring, decode to str.

Collectives™ on Stack Overflow

python3 shell mode can output an utf-8 character for some bytes and cannot for some other, what is the reason?

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related