Python 2.7, convert utf8 string to ascii

Question

I am working with python 2.7.12 I have string which contains a unicode literal, which is not of type Unicode. I would like to convert this to text. This example explains what I am trying to do.

>>> s
'\x00u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00'
>>> print s
username
>>> type(s)
<type 'str'>
>>> s == "username"
False

How would I go about converting this string?

ShadowRanger · Accepted Answer · 2016-11-16 04:49:17Z

That's not UTF-8, it's UTF-16, though it's unclear whether it's big endian or little endian (you have no BOM, and you have a leading and trailing NUL byte, making it an uneven length). For text in the ASCII range, UTF-8 is indistinguishable from ASCII, while UTF-16 alternates NUL bytes with the ASCII encoded bytes (as in your example).

In any event, converting to plain ASCII is fairly easy, you just need to deal with the uneven length one way or another:

s = 'u\x00s\x00e\x00r\x00n\x00a\x00m\x00e\x00' # I removed \x00 from beginning manually
sascii = s.decode('utf-16-le').encode('ascii')

# Or without manually removing leading \x00
sascii = s.decode('utf-16-be', errors='ignore').encode('ascii')

Course, if your inputs are just NUL interspersed ASCII and you can't figure out the endianness or how to get an even number of bytes, you can just cheat:

sascii = s.replace('\x00', '')

But that won't raise exceptions in the case where the input is some completely different encoding, so it may hide errors that specifying what you expect would have caught.

Collectives™ on Stack Overflow

Python 2.7, convert utf8 string to ascii

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related