Character encoding with Python 3

Question

If I run

print(chr(244).encode())

I get the two-byte result b'\xc3\xb4'. Why is that? I imagine the number 244 can be encoded into one byte!

You're not encoding the number 244, you're encoding the unicode code point 244. (244).to_bytes(1, 'big') (or 'little' for the second argument, doesn't matter in this case) indeed produces one byte. — user395760
– user395760, Commented Jan 16, 2014 at 22:07

Martijn Pieters · Accepted Answer · 2014-01-16 22:20:06Z

2

Your default locale appears to use UTF-8 as the output encoding.

Any codepoint outside the range 0-127 is encoded with multiple bytes in the variable-width UTF-8 codec.

You'll have to use a different codec to encode that codepoint to one byte. The Latin-1 encoding can manage it just fine, while the EBCDIC 500 codec (codepage 500) can too, but encodes to a different byte:

>>> print(chr(244).encode('utf8'))
b'\xc3\xb4'
>>> print(chr(244).encode('latin1'))
b'\xf4'
>>> print(chr(244).encode('cp500'))
b'\xcb'

But Latin-1 and EBCDIC 500 codecs can only encode 255 codepoints; UTF-8 can manage all of the Unicode standard.

If you were expecting the number 244 to be interpreted as a byte value instead, then you should not use chr().encode(); chr() produces a unicode value, not a 'byte', and encoding then produces a different result depending on the exact codec. That's because unicode values are text, not bytes.

Pass your number as a list of integers to the bytes() callable instead:

>>> bytes([244])
b'\xf4'

This only happens to fit the Latin-1 codec result, because the first 256 Unicode codepoints map directly to Latin 1 bytes, by design.

edited Jan 16, 2014 at 22:20

answered Jan 16, 2014 at 22:04

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

abarnert Over a year ago

It's worth giving an example that will map code point 244 to a single byte, bot not to 244, like EBCDIC-BE: chr(244).encode('cp500') gives you 203.

Martijn Pieters Over a year ago

@abarnert: Thanks, I was looking for a better codepage than the cp125* family.

abarnert Over a year ago

Remember those commercials where they tell you that in the future, when we all have flying cars and nano surgery, "IBM will be there"? Well, whenever you need an example of a codec that's different from everything you'd ever expect, IBM will be there.

Ignacio Vazquez-Abrams · Accepted Answer · 2014-01-16 22:03:22Z

0

Character #244 is U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX which is indeed encoded as 0xc3 0xb4 in UTF-8. If you want to use a single-byte encoding then you need to specify it.

answered Jan 16, 2014 at 22:03

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Comments

abarnert · Accepted Answer · 2014-01-16 22:15:17Z

I imagine the number 244 can be encoded into one byte!

Sure, if you design an encoding that can only handle 256 code points, all of them can be encoded into one byte.

But if you design an encoding that can handle all of Unicode's 111000+ code points, obviously you can't pack all of them into one byte.

If your only goal were to make things as compact as possible, you could use most of the 256 initial byte values for common code points, and only reserve a few as start bytes for less common code points.

However, if you only use the lower 128 for single-byte values, there are some big advantages. Especially if you design it so that every byte is unambiguously either a 7-bit character, a start byte, or a continuation byte. That makes the algorithm is a lot simpler to implement and faster, you can always scan forward or backward to the start of a character, you can search for ASCII text in a string with traditional byte-oriented (strchr) searches, a simple heuristic can detect your encoding very reliably, you can always detect truncated string start/end instead of misinterpreting it, etc. So, that's exactly what UTF-8 does.

Wikipedia explains UTF-8 pretty well. Rob Pike, one of the inventors of UTF-8, explains the design history in detail.

Collectives™ on Stack Overflow

Character encoding with Python 3

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related