1

I have a list of hex that I would like to transform into a list of unicode characters. Everything here is done with python-3.5.

If I do print(binary.fromhex('hex_number').decode('utf-8')) it works. But does not work if, after the conversion, I store, again, the chars in the list:

a = ['0063'] # Which is the hex equivalent to the c char.
b = [binary.fromhex(_).decode('utf-8') for _ in a]
print(b)

will print

['\x00c']

instead of

['c']

while the code

a = ['0063']
for _ in a:
    print(binary.fromhex(_).decode('utf-8'))

prints, has expected:

c

Can someone explain to me how I can convert the list ['0063'] in the list ['c'] and why I get this strange (to me) behavior?

To see what the 0063 hex corresponds look here.

9
  • Why would 0063, decoded as UTF-8, ever produce 'c'? And why would 030C map to a space (which encodes to 20 in UTF-8 hex)? Commented Oct 9, 2017 at 8:14
  • I can't figure out what codec you are thinking of here. U+030C maps to the COMBINING CARON codepoint in the Unicode standard, for example. Commented Oct 9, 2017 at 8:17
  • @MartijnPieters 0063 in hex corresponds to the 'c' in utf-8 (would be U+0063). This is easy to see if you just use the code above. The 030C corresponds to the COMBINING CARON, as you said. As I said in the question, this is shown as a space in my shell (probably because my shell is not able to map it to something). Honestly, I do not understand what is wrong with my question. I did not put much attention to the COMBINING CARON just because it was not really important to answer the question. But if you think, I can write something different that can be easily mapped by my shell. Commented Oct 9, 2017 at 8:29
  • @MartijnPieters I think now should be more clear based on your comments. Otherwise, just let me know. Commented Oct 9, 2017 at 8:41
  • Right, you appear to have confused Unicode codepoints with UTF-8. U+0063 LATIN SMALL LETTER C is 63 in UTF-8, while U+030C COMBINING CARON is CC8C. Unicode codepoints != UTF-8. Perhaps you are thinking of UTF-16 (big endian order) instead? Commented Oct 9, 2017 at 8:58

2 Answers 2

2

You don't have UTF-8 data, if 0063 is U+0063 LATIN SMALL LETTER C. At best you have UTF-16 data, big endian order:

>>> binary.fromhex('0063').decode('utf-16-be')
'c'

You may want to check if your full data starts with a Byte Order Mark, for big-endian UTF-16 that'd be 'FEFF' in hex, at which point you can drop the -be suffix as the decoder will know what byte order to use. If your data starts with 'FFFE' instead, you have little-endian encoded UTF-16 and you sliced your data at the wrong point; in that case you took along the '00' byte for the preceding codepoint.

UTF-8 is a variable width encoding. The first 128 codepoints in the Unicode standard (corresponding with the ASCII range), encode directly to single bytes, mapping directly to the ASCII standard. Codepoints in the Latin-1 range and beyond (up to U+07FF(*), the next 1919 codepoints) map to two bytes, etc.

If your input really was UTF-8, then you really have a \x00 NULL character before that 'c'. Printing a NULL results in no output on many terminals, but you can use cat -v to turn such non-printable characters into caret escape codes:

$ python3 -c "print('\x00c')"
c
$ python3 -c "print('\x00c')" | cat -v
^@c

^@ is the representation for a NULL in the caret notation used by cat.


(*) U+07FF is not currently mapped in Unicode; the last UTF-8 two-byte codepoint currently possible is U+07FA NKO LAJANYALAN.

Sign up to request clarification or add additional context in comments.

8 Comments

Ok... Maybe I am starting to understand this stuff. Unicode is a set of conventions on how to store the chars in the memory. utf-8 is following those conventions using only 8-bit. When I encode something which require more than 8-bit, that will be encoded using 16-bit and so on (this is done in an automagically way: the official encode is still utf-8). This works in encoding. When I want to decode something, I must know "a-priori" how many bits I am going to use. This means that if I have a non-ascii char I cannot use utf-8 for sure. Is this right?
UTF-8 is one of a set of possible serialisations of the Unicode text. Unicode is much more than just a bunch of codepoints; those conventions go beyond mere serialisation. UTF-8 can represent everything in Unicode, using a variable number of bytes. UTF-16 and UTF-32 are other serialisations, and they use a fixed number of bytes (2 and 4) per codepoint (where UTF-16 would use 2x 2 bytes for Unicode codepoints outside of the BMP, called surrogate pairs).
8-bit is not a characterisation to apply here. You need to know, a-priory, what serialisation standard (codec) was used. You can do some finger-printing, if your data starts with 0000FEFF or FFFE0000 then can assume, with high probability, that you have data using UTF-32 as the codec, for example.
@RiccardoPetraglia: Most codecs are not interchangeable. When you encode from str to bytes, you made a conscious choice to use the UTF-8 codec. You could have picked a different codec too. If you always settle for UTF-8, then you can always use the same codec too. If you don't, you need to record your selected codec somewhere. In XML documents, the first XML declaration is such a place. In HTML, a <meta> tag is often used.
@RiccardoPetraglia: in other words, if you are dealing with data from arbitrary sources, look for standard indicators for the codec, including the documented standard for the format.
|
1
a = ['0063'] # Which is the hex equivalent to the c char.
b = [chr(int(x,16)) for x in a]
print(b)

Thanks to 1

2 Comments

@ Martijn Pieters Just to understand better: is this solution agnostic of the codec used? (Maybe I should do a new question).
It works as your question needs and work for any Unicode character. You may need to use another input instead of using array of string of hex numbers as input.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.