9

In Python 2.7:

In [2]: utf8_str = '\xf0\x9f\x91\x8d'
In [3]: print(utf8_str)
👍
In [4]: unicode_str = utf8_str.decode('utf-8')
In [5]: print(unicode_str)
👍 
In [6]: unicode_str
Out[6]: u'\U0001f44d'
In [7]: len(unicode_str)
Out[7]: 2

Since unicode_str only contains a single unicode code point (0x0001f44d), why does len(unicode_str) return 2 instead of 1?

1 Answer 1

16

Your Python binary was compiled with UCS-2 support (a narrow build) and internally anything outside of the BMP (Basic Multilingual Plane) is represented using a surrogate pair.

That means such codepoints show up as 2 characters when asking for the length.

You'll have to recompile your Python binary to use UCS-4 instead if this matters (./configure --enable-unicode=ucs4 will enable it), or upgrade to Python 3.3 or newer, where Python's Unicode support was overhauled to use a variable-width Unicode type that switches between ASCII, UCS-2 and UCS-4 as required by the codepoints contained.

On Python versions 2.7 and 3.0 - 3.2, you can detect what kind of build you have by inspecting the sys.maxunicode value; it'll be 2^16-1 == 65535 == 0xFFFF for a narrow UCS-2 build, 1114111 == 0x10FFFF for a wide UCS-4 build. In Python 3.3 and up it is always set to 1114111.

Demo:

# Narrow build
$ bin/python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
65535 2 [u'\ud83d', u'\udc4d']
# Wide build
$ python -c 'import sys; print sys.maxunicode, len(u"\U0001f44d"), list(u"\U0001f44d")'
1114111 1 [u'\U0001f44d']
Sign up to request clarification or add additional context in comments.

7 Comments

you can use sys.maxunicode on Python 3 too. It is implied but it is worth pointing out it explicitly that len(u'\U0001f44d') == 1 on Python 3.3+ (or a wide Python 2 build)
@J.F.Sebastian: sure, but as of 3.3 it is a constant there, as Python 3.3 and up transparently switch between ASCII, UCS-2 an UCS-4 storage for strings as required. And you really don't want to use Python < 3.3 anyway.
There is no narrow/wide distinction on Python 3.3+ (the internal representation is not exposed -- you don't care what python uses internally). The point that you could use sys.maxunicode regardless of the version.
I never said there was such a distinction.
My system is running Python 3.6 and I double checked sys.maxunicode value to be 1114111, but still the length of this emoji/string is still displaying as 2 :_(
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.