1

I have the following string: u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' encoded in windows-1255 and I want to decode it into Unicode code points (u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9').

>>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

However, if I try to decode the string: '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' I don't get the exception:

>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

How do I decode the Unicode hex string (the one that gets the exception) or convert it to a regular string that can be decoded?

Thanks for the help.

4 Answers 4

4

That's because \xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9 is a byte array, not a Unicode string: The bytes represent valid windows-1255 characters rather than valid Unicode code points.

Therefore, when prepending it with a u, the Python interpreter can not decode the string, or even print it:

>>> print u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

So, in order to convert your byte array to UTF-8, you will have to decode it as windows-1255 and then encode it to utf-8:

>>> '\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
                                               .encode('utf8')
'\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'

Which gives the original Hebrew text:

>>> print '\xd7\x94\xd7\x97\xd7\x9c\xd7\xa7 \xd7\x94\xd7\xa9\xd7\x9c\xd7\x99\xd7\xa9\xd7\x99'
החלק השלישי
Sign up to request clarification or add additional context in comments.

2 Comments

The OP have u'\xe4\xe7...' string, not a '\xe4\xe7...', i suppose he didn't add u by himself
@stalk The string appears at the OP in two forms - with and without a u prefix. I thinks that's the point of the question.
3

I have the following string: u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9' encoded in windows-1255

That is self-contradictory. The u indicates it is a Unicode string. But if you say it is encoded in whatever, it must be a byte string (because a Unicode string can only be encoded into a byte string).

And indeed - your given entities - \xe4\xe7 etc. - represent a byte each, and only through the given encoding, windows-1255 they are given their respective meaning.

In other words, if you have a u'\xe4', you can be sure it is the same as u'\u00e4' and NOT u'\u05d4' as it would be the case otherwise.

If, by any chance, you got your erroneous Unicode string from a source which is unaware of this problem, you can derive from it the byte string you really need: with the help of a "1:1 coding", which is latin1.

So

correct_str = u_str.encode("latin1")
# now every byte of the correct_str corresponds to the respective code point in the 0x80..0xFF range
correct_u_str = correct_str.decode("windows-1255")

3 Comments

what is the meaning of "1:1 coding" ? Why u_str.encode("ascii") fails whereas u_str.encode("latin1") not?
@stalk With 1:1 encoding I want to say that this is an encoding which maps the first 256 Unicode code points to the 256 possible bytes. That is, as said, latin1. The ascii encoding only covers the ASCII range, which is 0..127.
...And it is worth mentioning that for every ASCII character, the Unicode equivalent has the same code. Therefore any ASCII text is a valid Unicode text as well.
1

Try this

>> u'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.encode('latin-1').decode('windows-1255')
u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

Comments

-1

Decode like this,

 >>> b'\xe4\xe7\xec\xf7 \xe4\xf9\xec\xe9\xf9\xe9'.decode('windows-1255')
    u'\u05d4\u05d7\u05dc\u05e7 \u05d4\u05e9\u05dc\u05d9\u05e9\u05d9'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.