0

I am trying to pack a char to bytes with python using the struct package but my code won't return 4 bytes when packing the char using this code:

def charToHex(s):
#check if string is unicode
if isinstance(s, str):
    print(struct.pack('<c', 'a'.encode(encoding='utf-8')))
    return '{:02x}'.format(struct.unpack('<I', struct.pack('<c', s.encode('utf-8')))[0])

#check if input is already a byte
elif isinstance(s, bytes):
    return '{:02x}'.format(struct.unpack('<I', struct.pack('<c', s))[0])

else:
    raise Exception()

Can anyone explain to me why this won't work? I am just trying to convert the unicode char to 4 bytes and unpack it but it won't even pack correct.

1
  • The c format is char in the C sense of a single byte, not the Python sense of a Unicode code point. Since the UTF-8 encoding of a Unicode character is anywhere from 1 to 4 bytes, you can't pack it as a c. You'd have to do something silly like pad it out to 4 bytes and pack that as 4c (at which point it's a lot simpler to use UTF-32 instead of UTF-8). Commented Jun 19, 2018 at 1:26

1 Answer 1

2

The c format is char in the C sense of a single byte, not the Python sense of a Unicode code point.

Meanwhile, the whole point of UTF-8 is that it's variable width. A character may encode to anything from 1 to 4 bytes. So you can't pack that into a c. You could pad it out to 4 bytes and then pack it into a 4c or an I or something, but that's a pretty silly thing to do.

If you want to use exactly 4 bytes for each character, much simpler to just use UTF-32. Or, since the UTF-32 encoding of a single character is just the Unicode code point as a 4-byte int, and that's exactly the same thing that ord returns, you can just skip the encode step.

For a single-char bytes, it does make sense to pack as a c—but then it makes no sense to unpack that as an i.

In fact, it's not clear what you're even using struct for here. If all you're trying to do is pack a number and unpack the same number, just use the number as-is.

Meanwhile, 02x doesn't make much sense as a format for a 4-byte integer, because a 4-byte integer takes 8 hex digits, not 2.

So, what you probably wanted was something like this:

def charToHex(s):
    #check if string is unicode
    if isinstance(s, str):
        return format(ord(s), '08x')
    #check if input is already a byte
    elif isinstance(s, bytes):
        return format(ord(s), '02x')
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you so much! I did the 02x because I wanted the output to be two hex numbers instead of the whole thing as it is because I want it to return only two hex numbers.
@Silver So what two hex digits do you want to return for a non-ASCII character like ł whose code point is 3 or more digits long in hex?
Aaah I get your point but the program won't take non-ASCII characters as an input and I have to stick to the requirements for this program.
@Silver In that case, you could s = s.encode(‘ascii') if given a str, and then only deal with the bytes case. (If someone does pass you a non-ASCII character somehow, you’ll get an exception from the encode.) But you can just use 02x as you say. In that case if you get non-ASCII characters somehow, you’ll end up returning 3 or 4 or 7 digits instead of 2 and probably screwing up your output instead of raising an exception. Which wouldn’t be great for a production web server, but for homework that didn’t even mention non ASCII characters, yeah, I’ll bet it’s fine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.