Get unicode code point of a character using Python

Question

In Python API, is there a way to extract the unicode code point of a single character?

Edit: In case it matters, I'm using Python 2.7.

e.g. for '\u304f' I want '304f'. is that what 'ord()' will do? Yes- docs.python.org/library/functions.html#ord — Ken
– Ken, Commented Sep 3, 2011 at 4:45
Yes, ord("\N{HIRAGANA LETTER KU}") is indeed 12367, aka 0x304F. I would never use numbers for characters the way you do, only named ones the way I do. Magic numbers are bad for your program. Just think of chr and ord as inverse functions of each other. It’s really easy. — tchrist
– tchrist, Commented Sep 3, 2011 at 4:48
@tchrist it might be worth noting chr is the opposite of ord in python 3.x, but in python 2.x unichr is the inverse of ord as chr only works for ordinals up to 255 in python 2.x. — cryo
– cryo, Commented Sep 3, 2011 at 5:08
@David: Yes, but I consider that a legacy system, which doesn't really work very well for Unicode — as you have yourself just demonstrated. chr and ord were always meant to be inverses, and it was a legacy Python 2 bug that they sometimes weren't. That's nuts. — tchrist
– tchrist, Commented Sep 3, 2011 at 5:09
@tchrist there are still lots of people using python 2.x. Even in python 3.x there are still narrow Unicode builds (for example most Windows builds of python 3.x are narrow.) I wouldn't call most 2.x Unicode issues bugs so much as additions to maintain backwards compatibility with older scripts, python 2.x usually works fine with Unicode. python 3.0 does make things much more consistent though, eliminating the difference between str and unicode. — cryo
– cryo, Commented Sep 3, 2011 at 5:27

Keith · Accepted Answer · 2011-09-03 04:39:09Z

110

If I understand your question correctly, you can do this.

>>> s='㈲'
>>> s.encode("unicode_escape")
b'\\u3232'

Shows the unicode escape code as a source string.

answered Sep 3, 2011 at 4:39

Keith

43.2k11 gold badges62 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Ken Over a year ago

In case it matters, I'm using Python 2.7.

MK Yung Over a year ago

What does the b mean ?

ShreevatsaR Over a year ago

For me, this doesn't work with ASCII characters: 'a'.encode('unicode_escape') gives a instead of '\\u. (Same with u'a'.encode('unicode_escape').) Also, the format is different when you go outside the Basic Multilingual Plane: u'😱'.encode('unicode_escape') gives '\\U0001f631'.

imrek Over a year ago

@ShreevatsaR Try "a".encode("unicode_escape").hex() to get the hexadecimal representation as a str. Alternatively, hex(ord("a")) will also work.

ShreevatsaR Over a year ago

@smido I was not saying that unicode_escape does the wrong thing: I agree that unicode_escape should return a for a. I was pointing out that for the question asked, namely how to answer “which Unicode codepoint is this” [getting a number], using unicode_escape is not an ideal solution as it helps [gives the "number"] only for (some) non-ASCII characters. See the other answers for more — it's better to use ord (assuming Python 3.3 or later etc). (That is also the answer the OP accepted, so I think I understood the question as intended, from their comments etc.)

|

Mike Graham · Accepted Answer · 2011-09-03 04:28:20Z

74

>>> ord(u"ć")
263
>>> u"café"[2]
u'f'
>>> u"café"[3]
u'\xe9'
>>> for c in u"café":
...     print repr(c), ord(c)
... 
u'c' 99
u'a' 97
u'f' 102
u'\xe9' 233

answered Sep 3, 2011 at 4:28

Mike Graham

77.2k16 gold badges105 silver badges131 bronze badges

8 Comments

Dietrich Epp Over a year ago

Of course, it might print out u'e' 101 and u'\u0301' 769 at the end insstead...

Ken Over a year ago

It looks like 'ord()' does what I want: docs.python.org/library/functions.html#ord. Thanks.

Ken Over a year ago

If 'c' is my character variable (say it's equal to 'あ'), if I do ucp = ord(c) then print ucp I get three integers, not a single integer. How do I get a single integer?

Karl Knechtel Over a year ago

How did you get あ into the variable? If it's a literal in your source code, then make sure your source file has an appropriate encoding set. Otherwise, ask a new question and post more detailed code.

Ken Over a year ago

In case it matters, I'm using Python 2.7.

|

Ben Hamilton · Accepted Answer · 2017-02-16 00:46:14Z

16

Turns out getting this right is fairly tricky: Python 2 and Python 3 have some subtle issues with extracting Unicode code points from a string.

Up until Python 3.3, it was possible to compile Python in one of two modes:

sys.maxunicode == 0x10FFFF

In this mode, Python's Unicode strings support the full range of Unicode code points from U+0000 to U+10FFFF. One code point is represented by one string element:

>>> import sys
>>> hex(sys.maxunicode)
'0x10ffff'
>>> len(u'\U0001F40D')
1
>>> [c for c in u'\U0001F40D']
[u'\U0001f40d']

This is the default for Python 2.7 on Linux, as well as universally on Python 3.3 and later across all operating systems.

sys.maxunicode == 0xFFFF

In this mode, Python's Unicode strings only support the range of Unicode code points from U+0000 to U+FFFF. Any code points from U+10000 through U+10FFFF are represented using a pair of string elements in the UTF-16 encoding::

>>> import sys
>>> hex(sys.maxunicode)
'0xffff'
>>> len(u'\U0001F40D')
2
>>> [c for c in u'\U0001F40D']
[u'\ud83d', u'\udc0d']

This is the default for Python 2.7 on macOS and Windows.

This runtime difference makes writing Python modules to manipulate Unicode strings as series of codepoints quite inconvenient.

The codepoints module

To solve this, I contributed a new module codepoints to PyPI:

https://pypi.python.org/pypi/codepoints/1.0

This module solves the problem by exposing APIs to convert Unicode strings to and from lists of code points, regardless of the underlying setting for sys.maxunicode::

>>> hex(sys.maxunicode)
'0xffff'
>>> snake = tuple(codepoints.from_unicode(u'\U0001F40D'))
>>> len(snake)
1
>>> snake[0]
128013
>> hex(snake[0])
'0x1f40d'
>>> codepoints.to_unicode(snake)
u'\U0001f40d'

answered Feb 16, 2017 at 0:46

Ben Hamilton

1611 silver badge3 bronze badges

4 Comments

thadk Over a year ago

Hello I'm trying to use codepoints with en.wikipedia.org/wiki/Regional_Indicator_Symbol offsets to make flags of various countries in Python. Here is a javascript implementation: github.com/thekelvinliu/country-code-emoji/blob/… How do I use codepoints.to_unicode(x) on a modified codes that has been offset by the appropriate letters of the basic flag?

thadk Over a year ago

UPDATE: figured it out, to_unicode needs at least a two-tuple.

Ben Hamilton Over a year ago

@thadk , glad you figured it out—but could you share with me the first code snippet you tried? I'm curious what didn't work.

thadk Over a year ago

import codepoints  #does not work #print(codepoints.to_unicode(tuple(127462))) #works print(codepoints.to_unicode((127462,))) #works ("AU" Australia Flag) print(codepoints.to_unicode((127462,127482)))

Samy Bencherif · Accepted Answer · 2017-08-19 22:21:17Z

13

Usually, you just do ord(character) to find the code point of a character. For completeness though, wide characters in the Unicode Supplementary Multilingual Plane are represented as surrogate pairs (i.e. two code units) in narrow Python builds, so in that case I often needed to do this small work-around:

def get_wide_ordinal(char):
    if len(char) != 2:
        return ord(char)
    return 0x10000 + (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) - 0xDC00)

This is rare in most applications though, so normally just use ord().

edited Aug 19, 2017 at 22:21

Samy Bencherif

1,37412 silver badges30 bronze badges

answered Sep 3, 2011 at 4:55

cryo

14.6k4 gold badges36 silver badges36 bronze badges

5 Comments

John Machin Over a year ago

A surrogate pair is NOT "two characters". It represents ONE character. It consists of two code points. See "code point" and "code point type" in unicode.org/glossary

Thanatos Over a year ago

@JohnMachin: You're close, but not quite: A surrogate pair is still just one code point. It's two code units.

John Machin Over a year ago

@Thanatos: Have you actually read the link that I provided? Have you followed through to D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. and the low equivalent D73?

Thanatos Over a year ago

@JohnMachin: It is slightly confusing that the standard uses that terminology. I suppose in some ways, they are code points — code points in those ranges are reserved for surrogate pairs. I think the standard is getting that the code points are reserved, that is all. Note, "The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form. They are unassigned to any abstract character."

Thanatos Over a year ago

My point was that a surrogate pair, once decoded, represent a single code point. There's only two things: the encoded UTF-16 stream of code units, or the decoded code point stream; for surrogate pairs, you'll have 2 in the former and 1 in the latter.

lookinghong · Accepted Answer · 2019-07-04 03:43:29Z

8

python2

>>> print hex(ord(u'人'))
0x4eba

edited Jul 4, 2019 at 3:43

answered Jul 4, 2019 at 3:37

lookinghong

1011 silver badge2 bronze badges

2 Comments

mhcpan Over a year ago

to get int value: int(hex(ord(u'人')),16)

wisbucky Mar 17 at 23:14

@mhcpan You can just simply do ord(u'人') instead of int(hex(ord(u'人')),16)

Collectives™ on Stack Overflow

Get unicode code point of a character using Python

5 Answers 5

7 Comments

8 Comments

The codepoints module

4 Comments

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

7 Comments

8 Comments

The codepoints module

4 Comments

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related