Python Unicode hex string decoding

Question

I have the following string: u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' encoded in windows-1255 and I want to decode it into Unicode code points (u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9').

>>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

However, if I try to decode the string: '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' I don't get the exception:

>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

How do I decode the Unicode hex string (the one that gets the exception) or convert it to a regular string that can be decoded?

Thanks for the help.

dda · Accepted Answer · 2014-11-01 13:43:10Z

4

That's because \xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9 is a byte array, not a Unicode string: The bytes represent valid windows-1255 characters rather than valid Unicode code points.

Therefore, when prepending it with a u, the Python interpreter can not decode the string, or even print it:

>>> print u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

So, in order to convert your byte array to UTF-8, you will have to decode it as windows-1255 and then encode it to utf-8:

>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
                                               .encode('utf8')
'\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'

Which gives the original Hebrew text:

>>> print '\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'
החלק השלישי

edited Nov 1, 2014 at 13:43

dda

6,2212 gold badges27 silver badges37 bronze badges

answered Nov 1, 2014 at 9:11

Adam Matan

138k155 gold badges414 silver badges585 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

stalk Over a year ago

The OP have u'\xe4\xe7...' string, not a '\xe4\xe7...', i suppose he didn't add u by himself

Adam Matan Over a year ago

@stalk The string appears at the OP in two forms - with and without a u prefix. I thinks that's the point of the question.

glglgl · Accepted Answer · 2014-11-01 09:54:52Z

3

I have the following string: u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' encoded in windows-1255

That is self-contradictory. The u indicates it is a Unicode string. But if you say it is encoded in whatever, it must be a byte string (because a Unicode string can only be encoded into a byte string).

And indeed - your given entities - \xe4\xe7 etc. - represent a byte each, and only through the given encoding, windows-1255 they are given their respective meaning.

In other words, if you have a u'\xe4', you can be sure it is the same as u'\u00e4' and NOT u'\u05d4' as it would be the case otherwise.

If, by any chance, you got your erroneous Unicode string from a source which is unaware of this problem, you can derive from it the byte string you really need: with the help of a "1:1 coding", which is latin1.

So

correct_str = u_str.encode("latin1")
# now every byte of the correct_str corresponds to the respective code point in the 0x80..0xFF range
correct_u_str = correct_str.decode("windows-1255")

edited Nov 1, 2014 at 9:54

answered Nov 1, 2014 at 9:49

glglgl

91.5k13 gold badges157 silver badges230 bronze badges

3 Comments

stalk Over a year ago

what is the meaning of "1:1 coding" ? Why u_str.encode("ascii") fails whereas u_str.encode("latin1") not?

glglgl Over a year ago

@stalk With 1:1 encoding I want to say that this is an encoding which maps the first 256 Unicode code points to the 256 possible bytes. That is, as said, latin1. The ascii encoding only covers the ASCII range, which is 0..127.

Adam Matan Over a year ago

...And it is worth mentioning that for every ASCII character, the Unicode equivalent has the same code. Therefore any ASCII text is a valid Unicode text as well.

sajadkk · Accepted Answer · 2014-11-01 09:12:48Z

1

Try this

>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.encode('latin-1').decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

answered Nov 1, 2014 at 9:12

sajadkk

7641 gold badge6 silver badges19 bronze badges

Comments

Vishnu Upadhyay · Accepted Answer · 2014-11-01 09:10:27Z

-1

Decode like this,

 >>> b'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
    u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

answered Nov 1, 2014 at 9:10

Vishnu Upadhyay

5,0611 gold badge17 silver badges24 bronze badges

Collectives™ on Stack Overflow

Python Unicode hex string decoding

4 Answers 4

2 Comments

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related