Pack char to bytes in python using struct

Question

I am trying to pack a char to bytes with python using the struct package but my code won't return 4 bytes when packing the char using this code:

def charToHex(s):
#check if string is unicode
if isinstance(s, str):
    print(struct.pack('<c', 'a'.encode(encoding='utf-8')))
    return '{:02x}'.format(struct.unpack('<I', struct.pack('<c', s.encode('utf-8')))[0])

#check if input is already a byte
elif isinstance(s, bytes):
    return '{:02x}'.format(struct.unpack('<I', struct.pack('<c', s))[0])

else:
    raise Exception()

Can anyone explain to me why this won't work? I am just trying to convert the unicode char to 4 bytes and unpack it but it won't even pack correct.

The c format is char in the C sense of a single byte, not the Python sense of a Unicode code point. Since the UTF-8 encoding of a Unicode character is anywhere from 1 to 4 bytes, you can't pack it as a c. You'd have to do something silly like pad it out to 4 bytes and pack that as 4c (at which point it's a lot simpler to use UTF-32 instead of UTF-8). — abarnert
– abarnert, Commented Jun 19, 2018 at 1:26

abarnert · Accepted Answer · 2018-06-19 01:33:01Z

2

The c format is char in the C sense of a single byte, not the Python sense of a Unicode code point.

Meanwhile, the whole point of UTF-8 is that it's variable width. A character may encode to anything from 1 to 4 bytes. So you can't pack that into a c. You could pad it out to 4 bytes and then pack it into a 4c or an I or something, but that's a pretty silly thing to do.

If you want to use exactly 4 bytes for each character, much simpler to just use UTF-32. Or, since the UTF-32 encoding of a single character is just the Unicode code point as a 4-byte int, and that's exactly the same thing that ord returns, you can just skip the encode step.

For a single-char bytes, it does make sense to pack as a c—but then it makes no sense to unpack that as an i.

In fact, it's not clear what you're even using struct for here. If all you're trying to do is pack a number and unpack the same number, just use the number as-is.

Meanwhile, 02x doesn't make much sense as a format for a 4-byte integer, because a 4-byte integer takes 8 hex digits, not 2.

So, what you probably wanted was something like this:

def charToHex(s):
    #check if string is unicode
    if isinstance(s, str):
        return format(ord(s), '08x')
    #check if input is already a byte
    elif isinstance(s, bytes):
        return format(ord(s), '02x')

answered Jun 19, 2018 at 1:33

abarnert

368k54 gold badges626 silver badges692 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

silverbullet Over a year ago

Thank you so much! I did the 02x because I wanted the output to be two hex numbers instead of the whole thing as it is because I want it to return only two hex numbers.

abarnert Over a year ago

@Silver So what two hex digits do you want to return for a non-ASCII character like ł whose code point is 3 or more digits long in hex?

silverbullet Over a year ago

Aaah I get your point but the program won't take non-ASCII characters as an input and I have to stick to the requirements for this program.

abarnert Over a year ago

@Silver In that case, you could s = s.encode(‘ascii') if given a str, and then only deal with the bytes case. (If someone does pass you a non-ASCII character somehow, you’ll get an exception from the encode.) But you can just use 02x as you say. In that case if you get non-ASCII characters somehow, you’ll end up returning 3 or 4 or 7 digits instead of 2 and probably screwing up your output instead of raising an exception. Which wouldn’t be great for a production web server, but for homework that didn’t even mention non ASCII characters, yeah, I’ll bet it’s fine.

Collectives™ on Stack Overflow

Pack char to bytes in python using struct

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related