In Python API, is there a way to extract the unicode code point of a single character?
Edit: In case it matters, I'm using Python 2.7.
If I understand your question correctly, you can do this.
>>> s='㈲'
>>> s.encode("unicode_escape")
b'\\u3232'
Shows the unicode escape code as a source string.
b mean ?'a'.encode('unicode_escape') gives a instead of '\\u. (Same with u'a'.encode('unicode_escape').) Also, the format is different when you go outside the Basic Multilingual Plane: u'😱'.encode('unicode_escape') gives '\\U0001f631'."a".encode("unicode_escape").hex() to get the hexadecimal representation as a str. Alternatively, hex(ord("a")) will also work.unicode_escape does the wrong thing: I agree that unicode_escape should return a for a. I was pointing out that for the question asked, namely how to answer “which Unicode codepoint is this” [getting a number], using unicode_escape is not an ideal solution as it helps [gives the "number"] only for (some) non-ASCII characters. See the other answers for more — it's better to use ord (assuming Python 3.3 or later etc). (That is also the answer the OP accepted, so I think I understood the question as intended, from their comments etc.)>>> ord(u"ć")
263
>>> u"café"[2]
u'f'
>>> u"café"[3]
u'\xe9'
>>> for c in u"café":
... print repr(c), ord(c)
...
u'c' 99
u'a' 97
u'f' 102
u'\xe9' 233
u'e' 101 and u'\u0301' 769 at the end insstead...ucp = ord(c) then print ucp I get three integers, not a single integer. How do I get a single integer?Turns out getting this right is fairly tricky: Python 2 and Python 3 have some subtle issues with extracting Unicode code points from a string.
Up until Python 3.3, it was possible to compile Python in one of two modes:
sys.maxunicode == 0x10FFFFIn this mode, Python's Unicode strings support the full range of Unicode code points from U+0000 to U+10FFFF. One code point is represented by one string element:
>>> import sys
>>> hex(sys.maxunicode)
'0x10ffff'
>>> len(u'\U0001F40D')
1
>>> [c for c in u'\U0001F40D']
[u'\U0001f40d']
This is the default for Python 2.7 on Linux, as well as universally on Python 3.3 and later across all operating systems.
sys.maxunicode == 0xFFFFIn this mode, Python's Unicode strings only support the range of Unicode code points from U+0000 to U+FFFF. Any code points from U+10000 through U+10FFFF are represented using a pair of string elements in the UTF-16 encoding::
>>> import sys
>>> hex(sys.maxunicode)
'0xffff'
>>> len(u'\U0001F40D')
2
>>> [c for c in u'\U0001F40D']
[u'\ud83d', u'\udc0d']
This is the default for Python 2.7 on macOS and Windows.
This runtime difference makes writing Python modules to manipulate Unicode strings as series of codepoints quite inconvenient.
To solve this, I contributed a new module codepoints to PyPI:
https://pypi.python.org/pypi/codepoints/1.0
This module solves the problem by exposing APIs to convert Unicode strings to and from lists of code points, regardless of the underlying setting for sys.maxunicode::
>>> hex(sys.maxunicode)
'0xffff'
>>> snake = tuple(codepoints.from_unicode(u'\U0001F40D'))
>>> len(snake)
1
>>> snake[0]
128013
>> hex(snake[0])
'0x1f40d'
>>> codepoints.to_unicode(snake)
u'\U0001f40d'
codepoints.to_unicode(x) on a modified codes that has been offset by the appropriate letters of the basic flag?import codepoints #does not work #print(codepoints.to_unicode(tuple(127462))) #works print(codepoints.to_unicode((127462,))) #works ("AU" Australia Flag) print(codepoints.to_unicode((127462,127482)))Usually, you just do ord(character) to find the code point of a character. For completeness though, wide characters in the Unicode Supplementary Multilingual Plane are represented as surrogate pairs (i.e. two code units) in narrow Python builds, so in that case I often needed to do this small work-around:
def get_wide_ordinal(char):
if len(char) != 2:
return ord(char)
return 0x10000 + (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) - 0xDC00)
This is rare in most applications though, so normally just use ord().
D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. and the low equivalent D73?
ord("\N{HIRAGANA LETTER KU}")is indeed 12367, aka 0x304F. I would never use numbers for characters the way you do, only named ones the way I do. Magic numbers are bad for your program. Just think ofchrandordas inverse functions of each other. It’s really easy.chris the opposite ofordin python 3.x, but in python 2.xunichris the inverse ofordaschronly works for ordinals up to 255 in python 2.x.chrandordwere always meant to be inverses, and it was a legacy Python 2 bug that they sometimes weren't. That's nuts.strandunicode.