0

If I run

print(chr(244).encode())

I get the two-byte result b'\xc3\xb4'. Why is that? I imagine the number 244 can be encoded into one byte!

1
  • 1
    You're not encoding the number 244, you're encoding the unicode code point 244. (244).to_bytes(1, 'big') (or 'little' for the second argument, doesn't matter in this case) indeed produces one byte. Commented Jan 16, 2014 at 22:07

3 Answers 3

2

Your default locale appears to use UTF-8 as the output encoding.

Any codepoint outside the range 0-127 is encoded with multiple bytes in the variable-width UTF-8 codec.

You'll have to use a different codec to encode that codepoint to one byte. The Latin-1 encoding can manage it just fine, while the EBCDIC 500 codec (codepage 500) can too, but encodes to a different byte:

>>> print(chr(244).encode('utf8'))
b'\xc3\xb4'
>>> print(chr(244).encode('latin1'))
b'\xf4'
>>> print(chr(244).encode('cp500'))
b'\xcb'

But Latin-1 and EBCDIC 500 codecs can only encode 255 codepoints; UTF-8 can manage all of the Unicode standard.

If you were expecting the number 244 to be interpreted as a byte value instead, then you should not use chr().encode(); chr() produces a unicode value, not a 'byte', and encoding then produces a different result depending on the exact codec. That's because unicode values are text, not bytes.

Pass your number as a list of integers to the bytes() callable instead:

>>> bytes([244])
b'\xf4'

This only happens to fit the Latin-1 codec result, because the first 256 Unicode codepoints map directly to Latin 1 bytes, by design.

Sign up to request clarification or add additional context in comments.

3 Comments

It's worth giving an example that will map code point 244 to a single byte, bot not to 244, like EBCDIC-BE: chr(244).encode('cp500') gives you 203.
@abarnert: Thanks, I was looking for a better codepage than the cp125* family.
Remember those commercials where they tell you that in the future, when we all have flying cars and nano surgery, "IBM will be there"? Well, whenever you need an example of a codec that's different from everything you'd ever expect, IBM will be there.
0

Character #244 is U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX which is indeed encoded as 0xc3 0xb4 in UTF-8. If you want to use a single-byte encoding then you need to specify it.

Comments

0

I imagine the number 244 can be encoded into one byte!

Sure, if you design an encoding that can only handle 256 code points, all of them can be encoded into one byte.

But if you design an encoding that can handle all of Unicode's 111000+ code points, obviously you can't pack all of them into one byte.

If your only goal were to make things as compact as possible, you could use most of the 256 initial byte values for common code points, and only reserve a few as start bytes for less common code points.

However, if you only use the lower 128 for single-byte values, there are some big advantages. Especially if you design it so that every byte is unambiguously either a 7-bit character, a start byte, or a continuation byte. That makes the algorithm is a lot simpler to implement and faster, you can always scan forward or backward to the start of a character, you can search for ASCII text in a string with traditional byte-oriented (strchr) searches, a simple heuristic can detect your encoding very reliably, you can always detect truncated string start/end instead of misinterpreting it, etc. So, that's exactly what UTF-8 does.

Wikipedia explains UTF-8 pretty well. Rob Pike, one of the inventors of UTF-8, explains the design history in detail.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.