0

I have this:

>>> su = u'"/\"'

In python, how can I convert this to a representation that shows the unicode code points? That would be this for the string above

u'\u0022\u002F\u005C\u0022'

2 Answers 2

5

Your original string is not four characters but three because \" is an escape code for a double quote:

>>> su = u'"/\"'
>>> len(su)
3

Here's how to display it as escape codes:

>>> ''.join(u'\\u{:04X}'.format(ord(c)) for c in su)
u'\\u0022\\u002F\\u0022'

Use a Unicode raw string, or double backslashes to escape the slash and get four characters:

>>> su = ur'"/\"' # Raw version
>>> ''.join(u'\\u{:04X}'.format(ord(c)) for c in su)
u'\\u0022\\u002F\\u005C\\u0022'

>>> su = u'"/\\"' # Escaped version
>>> ''.join(u'\\u{:04X}'.format(ord(c)) for c in su)
u'\\u0022\\u002F\\u005C\\u0022'

Note the double backslash in the result. This indicates it is a single literal backslash. with one backslash, they would be escape codes...no different from your original string:

>>> ur'"/\"' == u'\u0022\u002F\u005C\u0022'
True

Printing it shows the content of the strings:

>>> print u'\u0022\u002F\u005C\u0022'
"/\"
>>> print(''.join(u'\\u{:04X}'.format(ord(c)) for c in su))
\u0022\u002F\u005C\u0022
Sign up to request clarification or add additional context in comments.

5 Comments

Why do I need to use ur (unicode raw) ? It works well for me without using ur version of the string.
You don't, if you intended to escape the " in the first place. Still a bit of a confusing example, however.
@abc is your intended string "/" or "/\". Putting a \ character in an example string, especially before a character that can be escaped (such as ") without clarifying intention results in this confusion.
@metatoaster Actually I was confused myself. Looks like my string is 3 chars long not 4. So my string is "/".
@abc: note: the above works only for BMP characters e.g., it may fail an emoji such as u'😀' (U+1f600)
1

To support the full Unicode range, you could use unicode-escape to get the text representation. To represent characters in the ascii range as the unicode escapes too and to force \u00xx representation even for u'\xff', you could use a regex:

#!/usr/bin/env python2
import re

su = u'"/"\U000af600'
assert u'\ud800' not in su # no lone surrogate
print re.sub(ur'[\x00-\xff]', lambda m: u"\ud800u%04x" % ord(m.group()), su, 
             flags=re.U).encode('unicode-escape').replace('\\ud800', '\\')

a lone surrogate (U+d800) is used to avoid escaping the backslash twice.

Output

\u0022\u002f\u0022\U000af600

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.