Convert unicode string into byte string [duplicate]

Question

I have a string like:

s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

I need to be able to get the corresponding byte literal of that unicode (for pickle.loads):

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

Here the solution of using s_new: bytes = bytes(s_str, encoding="raw_unicode_escape") was posted, but it does not work for me. I got an incorrect result: b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04' that has two backslashes (actually representing only one) for each one that it should have.

Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Why does this occur? How do I get the bytes result I want?

Other than the solution in the answers this answer also works. — Tereso del Río Almajano
– Tereso del Río Almajano, Commented Sep 7, 2021 at 11:53
What "byte literal of that unicode" mean? Unicode has just code points, no byte representation (so abstract). It also defines few encodings, but so you should specify which encoding. Note: your initial string is already problematic: what do you mean with \x? Whant do you mean with \xc0? '\x' should not be used on unicode strings (but just on encoded strings or binary data). For unicode just use codepoints (\u and \U). I think your main problem is that you are mixing too many concepts (on a non recommended way), so it is easy to get it wrong. — Giacomo Catenazzi
– Giacomo Catenazzi, Commented Sep 7, 2021 at 12:27
It is not possible to get s_not_bytes (the result of s_new) from s_str as you have shown. print(repr(s_str)) and post that. — Mark Tolonen
– Mark Tolonen, Commented Sep 7, 2021 at 16:19
The "raw-unicode-escape" encoding is what you want for the problem you described, and works for the input you show. Based on the answer that was given, and the symptoms described, the diagnosis is that s_str actually contains the backslashes. I edited the question to reflect that. I assume that's what you were trying to get at by talking about "raw Unicode"; but none of that part actually described it properly. — Karl Knechtel
– Karl Knechtel, Commented Aug 6, 2022 at 0:49

Mark Tolonen · Accepted Answer · 2021-09-07 16:41:27Z

1

You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:

s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

Note the difference. \\ is an escape code indicating a literal, single backslash:

>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36

The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:

s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)

Output:

b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

If you did have s_str as posted, a simple .encode('latin1') would convert it:

>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

edited Sep 7, 2021 at 16:41

answered Sep 7, 2021 at 16:26

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Tereso del Río Almajano Over a year ago

Thanks, this solves the issue. I was reading this from a file using open(file,'r') and I guess that creates a raw string.

Tereso del Río Almajano Over a year ago

And is there a way of reading from a file containing (raw text) either b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'or \x00\x01\x00\xc0\x01\x00\x00\x00\x04 so that it will be considered directly a string of bytes of length 9?

Mark Tolonen Over a year ago

@TeresodelRíoAlmajano Reading a file doesn’t create a raw string. Raw strings are a way of creating string literals in code without interpreting escape codes. Your file had text with escape-code-like text. You can open(file,encoding='unicode_escape') if needed, but it would be better to post an actual sample of the file in case their is a better solution.

Karl Knechtel Over a year ago

"and I guess that creates a raw string" It does not "create a raw string". There is not such a thing as a "raw string". However, reading a file into a string does mean that the string contains what the file actually contains - if there's a backslash followed by a lowercase n, then it's a backslash followed by a lowercase n, not a newline. Escape sequences only apply to string literals in your source code, unless you explicitly do something to interpret them. They apply before the code runs.

Tereso del Río Almajano · Accepted Answer · 2021-09-07 10:47:15Z

0

I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:

s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")

As I said I have no idea why this works so feel free to explain it if you know why.

answered Sep 7, 2021 at 10:47

Tereso del Río Almajano

1251 silver badge16 bronze badges

Comments

Daweo · Accepted Answer · 2021-09-07 11:01:00Z

0

You might simply use .encode("utf-8") to get desired result i.e.:

s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)

output

b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'

answered Sep 7, 2021 at 11:01

Daweo

38.2k3 gold badges17 silver badges32 bronze badges

3 Comments

Tereso del Río Almajano Over a year ago

No, that does not solve the problem of the double backslash. At least for me.

Daweo Over a year ago

@TeresodelRíoAlmajano what version of python are you using?

Tereso del Río Almajano Over a year ago

I am using Python 3.9. But the other answer explains why I was not getting the desired result.

Collectives™ on Stack Overflow

Convert unicode string into byte string [duplicate]

3 Answers 3

4 Comments

Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

3 Comments

Linked

Related