Converting a unicode string to see unicode code points

Question

I have this:

>>> su = u'"/\"'

In python, how can I convert this to a representation that shows the unicode code points? That would be this for the string above

u'\u0022\u002F\u005C\u0022'

Mark Tolonen · Accepted Answer · 2015-10-22 02:13:31Z

5

Your original string is not four characters but three because \" is an escape code for a double quote:

>>> su = u'"/\"'
>>> len(su)
3

Here's how to display it as escape codes:

>>> ''.join(u'\\u{:04X}'.format(ord(c)) for c in su)
u'\\u0022\\u002F\\u0022'

Use a Unicode raw string, or double backslashes to escape the slash and get four characters:

>>> su = ur'"/\"' # Raw version
>>> ''.join(u'\\u{:04X}'.format(ord(c)) for c in su)
u'\\u0022\\u002F\\u005C\\u0022'

>>> su = u'"/\\"' # Escaped version
>>> ''.join(u'\\u{:04X}'.format(ord(c)) for c in su)
u'\\u0022\\u002F\\u005C\\u0022'

Note the double backslash in the result. This indicates it is a single literal backslash. with one backslash, they would be escape codes...no different from your original string:

>>> ur'"/\"' == u'\u0022\u002F\u005C\u0022'
True

Printing it shows the content of the strings:

>>> print u'\u0022\u002F\u005C\u0022'
"/\"
>>> print(''.join(u'\\u{:04X}'.format(ord(c)) for c in su))
\u0022\u002F\u005C\u0022

edited Oct 22, 2015 at 2:13

answered Oct 22, 2015 at 1:55

Mark Tolonen

181k26 gold badges182 silver badges279 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ankur Agarwal Over a year ago

Why do I need to use ur (unicode raw) ? It works well for me without using ur version of the string.

metatoaster Over a year ago

You don't, if you intended to escape the " in the first place. Still a bit of a confusing example, however.

metatoaster Over a year ago

@abc is your intended string "/" or "/\". Putting a \ character in an example string, especially before a character that can be escaped (such as ") without clarifying intention results in this confusion.

Ankur Agarwal Over a year ago

@metatoaster Actually I was confused myself. Looks like my string is 3 chars long not 4. So my string is "/".

jfs Over a year ago

@abc: note: the above works only for BMP characters e.g., it may fail an emoji such as u'😀' (U+1f600)

jfs · Accepted Answer · 2015-10-24 13:01:51Z

1

To support the full Unicode range, you could use unicode-escape to get the text representation. To represent characters in the ascii range as the unicode escapes too and to force \u00xx representation even for u'\xff', you could use a regex:

#!/usr/bin/env python2
import re

su = u'"/"\U000af600'
assert u'\ud800' not in su # no lone surrogate
print re.sub(ur'[\x00-\xff]', lambda m: u"\ud800u%04x" % ord(m.group()), su, 
             flags=re.U).encode('unicode-escape').replace('\\ud800', '\\')

a lone surrogate (U+d800) is used to avoid escaping the backslash twice.

Output

\u0022\u002f\u0022\U000af600

edited Oct 24, 2015 at 13:01

answered Oct 23, 2015 at 16:06

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Collectives™ on Stack Overflow

Converting a unicode string to see unicode code points

2 Answers 2

5 Comments

Output

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Output

Comments

Your Answer

Sign up or log in

Post as a guest

Related