1

I have an UTF-8 character encoded with `_' in between, e.g., '_ea_b4_80'. I'm trying to convert it into UTF-8 character using replace method, but I can't get the correct encoding.

This is a code example:

import sys
reload(sys)  
sys.setdefaultencoding('utf8')

r = '_ea_b4_80'
r2 = '\xea\xb4\x80'

r = r.replace('_', '\\x')
print r
print r.encode("utf-8")
print r2

In this example, r is not the same as r2; this is an output.

\xea\xb4\x80
\xea\xb4\x80
관  <-- correctly shown 

What might be wrong?

0

1 Answer 1

4

\x is only meaningful in string literals, you're can't use replace to add it.

To get your desired result, convert to bytes, then decode:

import binascii

r = '_ea_b4_80'

rhexonly = r.replace('_', '')          # Returns 'eab480'
rbytes = binascii.unhexlify(rhexonly)  # Returns b'\xea\xb4\x80'
rtext = rbytes.decode('utf-8')         # Returns '관' (unicode if Py2, str Py3)
print(rtext)

which should get you as you desire.

If you're using modern Py3, you can avoid the import (assuming r is in fact a str; bytes.fromhex, unlike binascii.hexlify, only take str inputs, not bytes inputs) using the bytes.fromhex class method in place of binascii.unhexlify:

rbytes = bytes.fromhex(rhexonly)  # Returns b'\xea\xb4\x80'
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.