3

I am working with python 2.7.12 I have string which contains a unicode literal, which is not of type Unicode. I would like to convert this to text. This example explains what I am trying to do.

>>> s
'\x00u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00'
>>> print s
username
>>> type(s)
<type 'str'>
>>> s == "username"
False

How would I go about converting this string?

1 Answer 1

2

That's not UTF-8, it's UTF-16, though it's unclear whether it's big endian or little endian (you have no BOM, and you have a leading and trailing NUL byte, making it an uneven length). For text in the ASCII range, UTF-8 is indistinguishable from ASCII, while UTF-16 alternates NUL bytes with the ASCII encoded bytes (as in your example).

In any event, converting to plain ASCII is fairly easy, you just need to deal with the uneven length one way or another:

s = 'u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00' # I removed \x00 from beginning manually
sascii = s.decode('utf-16-le').encode('ascii')

# Or without manually removing leading \x00
sascii = s.decode('utf-16-be', errors='ignore').encode('ascii')

Course, if your inputs are just NUL interspersed ASCII and you can't figure out the endianness or how to get an even number of bytes, you can just cheat:

sascii = s.replace('\x00', '')

But that won't raise exceptions in the case where the input is some completely different encoding, so it may hide errors that specifying what you expect would have caught.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.