98

In Python API, is there a way to extract the unicode code point of a single character?

Edit: In case it matters, I'm using Python 2.7.

7
  • 1
    e.g. for '\u304f' I want '304f'. is that what 'ord()' will do? Yes- docs.python.org/library/functions.html#ord Commented Sep 3, 2011 at 4:45
  • 2
    Yes, ord("\N{HIRAGANA LETTER KU}") is indeed 12367, aka 0x304F. I would never use numbers for characters the way you do, only named ones the way I do. Magic numbers are bad for your program. Just think of chr and ord as inverse functions of each other. It’s really easy. Commented Sep 3, 2011 at 4:48
  • @tchrist it might be worth noting chr is the opposite of ord in python 3.x, but in python 2.x unichr is the inverse of ord as chr only works for ordinals up to 255 in python 2.x. Commented Sep 3, 2011 at 5:08
  • @David: Yes, but I consider that a legacy system, which doesn't really work very well for Unicode — as you have yourself just demonstrated. chr and ord were always meant to be inverses, and it was a legacy Python 2 bug that they sometimes weren't. That's nuts. Commented Sep 3, 2011 at 5:09
  • 2
    @tchrist there are still lots of people using python 2.x. Even in python 3.x there are still narrow Unicode builds (for example most Windows builds of python 3.x are narrow.) I wouldn't call most 2.x Unicode issues bugs so much as additions to maintain backwards compatibility with older scripts, python 2.x usually works fine with Unicode. python 3.0 does make things much more consistent though, eliminating the difference between str and unicode. Commented Sep 3, 2011 at 5:27

5 Answers 5

110

If I understand your question correctly, you can do this.

>>> s='㈲'
>>> s.encode("unicode_escape")
b'\\u3232'

Shows the unicode escape code as a source string.

Sign up to request clarification or add additional context in comments.

7 Comments

In case it matters, I'm using Python 2.7.
What does the b mean ?
For me, this doesn't work with ASCII characters: 'a'.encode('unicode_escape') gives a instead of '\\u. (Same with u'a'.encode('unicode_escape').) Also, the format is different when you go outside the Basic Multilingual Plane: u'😱'.encode('unicode_escape') gives '\\U0001f631'.
@ShreevatsaR Try "a".encode("unicode_escape").hex() to get the hexadecimal representation as a str. Alternatively, hex(ord("a")) will also work.
@smido I was not saying that unicode_escape does the wrong thing: I agree that unicode_escape should return a for a. I was pointing out that for the question asked, namely how to answer “which Unicode codepoint is this” [getting a number], using unicode_escape is not an ideal solution as it helps [gives the "number"] only for (some) non-ASCII characters. See the other answers for more — it's better to use ord (assuming Python 3.3 or later etc). (That is also the answer the OP accepted, so I think I understood the question as intended, from their comments etc.)
|
74
>>> ord(u"ć")
263
>>> u"café"[2]
u'f'
>>> u"café"[3]
u'\xe9'
>>> for c in u"café":
...     print repr(c), ord(c)
... 
u'c' 99
u'a' 97
u'f' 102
u'\xe9' 233

8 Comments

Of course, it might print out u'e' 101 and u'\u0301' 769 at the end insstead...
It looks like 'ord()' does what I want: docs.python.org/library/functions.html#ord. Thanks.
If 'c' is my character variable (say it's equal to 'あ'), if I do ucp = ord(c) then print ucp I get three integers, not a single integer. How do I get a single integer?
How did you get あ into the variable? If it's a literal in your source code, then make sure your source file has an appropriate encoding set. Otherwise, ask a new question and post more detailed code.
In case it matters, I'm using Python 2.7.
|
16

Turns out getting this right is fairly tricky: Python 2 and Python 3 have some subtle issues with extracting Unicode code points from a string.

Up until Python 3.3, it was possible to compile Python in one of two modes:

  1. sys.maxunicode == 0x10FFFF

In this mode, Python's Unicode strings support the full range of Unicode code points from U+0000 to U+10FFFF. One code point is represented by one string element:

>>> import sys
>>> hex(sys.maxunicode)
'0x10ffff'
>>> len(u'\U0001F40D')
1
>>> [c for c in u'\U0001F40D']
[u'\U0001f40d']

This is the default for Python 2.7 on Linux, as well as universally on Python 3.3 and later across all operating systems.

  1. sys.maxunicode == 0xFFFF

In this mode, Python's Unicode strings only support the range of Unicode code points from U+0000 to U+FFFF. Any code points from U+10000 through U+10FFFF are represented using a pair of string elements in the UTF-16 encoding::

>>> import sys
>>> hex(sys.maxunicode)
'0xffff'
>>> len(u'\U0001F40D')
2
>>> [c for c in u'\U0001F40D']
[u'\ud83d', u'\udc0d']

This is the default for Python 2.7 on macOS and Windows.

This runtime difference makes writing Python modules to manipulate Unicode strings as series of codepoints quite inconvenient.

The codepoints module

To solve this, I contributed a new module codepoints to PyPI:

https://pypi.python.org/pypi/codepoints/1.0

This module solves the problem by exposing APIs to convert Unicode strings to and from lists of code points, regardless of the underlying setting for sys.maxunicode::

>>> hex(sys.maxunicode)
'0xffff'
>>> snake = tuple(codepoints.from_unicode(u'\U0001F40D'))
>>> len(snake)
1
>>> snake[0]
128013
>> hex(snake[0])
'0x1f40d'
>>> codepoints.to_unicode(snake)
u'\U0001f40d'

4 Comments

Hello I'm trying to use codepoints with en.wikipedia.org/wiki/Regional_Indicator_Symbol offsets to make flags of various countries in Python. Here is a javascript implementation: github.com/thekelvinliu/country-code-emoji/blob/… How do I use codepoints.to_unicode(x) on a modified codes that has been offset by the appropriate letters of the basic flag?
UPDATE: figured it out, to_unicode needs at least a two-tuple.
@thadk , glad you figured it out—but could you share with me the first code snippet you tried? I'm curious what didn't work.
import codepoints #does not work #print(codepoints.to_unicode(tuple(127462))) #works print(codepoints.to_unicode((127462,))) #works ("AU" Australia Flag) print(codepoints.to_unicode((127462,127482)))
13

Usually, you just do ord(character) to find the code point of a character. For completeness though, wide characters in the Unicode Supplementary Multilingual Plane are represented as surrogate pairs (i.e. two code units) in narrow Python builds, so in that case I often needed to do this small work-around:

def get_wide_ordinal(char):
    if len(char) != 2:
        return ord(char)
    return 0x10000 + (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) - 0xDC00)

This is rare in most applications though, so normally just use ord().

5 Comments

A surrogate pair is NOT "two characters". It represents ONE character. It consists of two code points. See "code point" and "code point type" in unicode.org/glossary
@JohnMachin: You're close, but not quite: A surrogate pair is still just one code point. It's two code units.
@Thanatos: Have you actually read the link that I provided? Have you followed through to D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. and the low equivalent D73?
@JohnMachin: It is slightly confusing that the standard uses that terminology. I suppose in some ways, they are code points — code points in those ranges are reserved for surrogate pairs. I think the standard is getting that the code points are reserved, that is all. Note, "The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form. They are unassigned to any abstract character."
My point was that a surrogate pair, once decoded, represent a single code point. There's only two things: the encoded UTF-16 stream of code units, or the decoded code point stream; for surrogate pairs, you'll have 2 in the former and 1 in the latter.
8

python2

>>> print hex(ord(u'人'))
0x4eba

2 Comments

to get int value: int(hex(ord(u'人')),16)
@mhcpan You can just simply do ord(u'人') instead of int(hex(ord(u'人')),16)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.