1

I have a string like:

s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

I need to be able to get the corresponding byte literal of that unicode (for pickle.loads):

s_bytes: bytes = b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

Here the solution of using s_new: bytes = bytes(s_str, encoding="raw_unicode_escape") was posted, but it does not work for me. I got an incorrect result: b'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04' that has two backslashes (actually representing only one) for each one that it should have.

Also here and here a similar solution is proposed, but it does not work for me either, I end up getting the double backslashes again. Why does this occur? How do I get the bytes result I want?

4
  • Other than the solution in the answers this answer also works. Commented Sep 7, 2021 at 11:53
  • 1
    What "byte literal of that unicode" mean? Unicode has just code points, no byte representation (so abstract). It also defines few encodings, but so you should specify which encoding. Note: your initial string is already problematic: what do you mean with \x? Whant do you mean with \xc0? '\x' should not be used on unicode strings (but just on encoded strings or binary data). For unicode just use codepoints (\u and \U). I think your main problem is that you are mixing too many concepts (on a non recommended way), so it is easy to get it wrong. Commented Sep 7, 2021 at 12:27
  • It is not possible to get s_not_bytes (the result of s_new) from s_str as you have shown. print(repr(s_str)) and post that. Commented Sep 7, 2021 at 16:19
  • The "raw-unicode-escape" encoding is what you want for the problem you described, and works for the input you show. Based on the answer that was given, and the symptoms described, the diagnosis is that s_str actually contains the backslashes. I edited the question to reflect that. I assume that's what you were trying to get at by talking about "raw Unicode"; but none of that part actually described it properly. Commented Aug 6, 2022 at 0:49

3 Answers 3

1

You do not have byte escape codes as shown below (length 9) or you wouldn't get the s_not_bytes result:

s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

You have literal escape codes (length 36), and note the r for raw string that prevents interpreting the escape codes as bytes:

s_str: str = r"\x00\x01\x00\xc0\x01\x00\x00\x00\x04"

Note the difference. \\ is an escape code indicating a literal, single backslash:

>>> '\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\x00\x01\x00À\x01\x00\x00\x00\x04'
>>> r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
'\\x00\\x01\\x00\\xc0\\x01\\x00\\x00\\x00\\x04'
>>> len('\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
9
>>> len(r'\x00\x01\x00\xc0\x01\x00\x00\x00\x04')
36

The following gets the desired byte string by converting each code point to a byte using the latin1 codec, which maps 1:1 between the first 256 code points (U+0000 to U+00FF) and the byte values 0x00 to 0xFF. Then it decodes the literal escape codes, resulting in a Unicode string again so once more encode using latin1 to convert 1:1 back to bytes:

s_bytes: bytes = s_str.encode('latin1').decode('unicode_escape').encode('latin1')
print(s_bytes)

Output:

b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'

If you did have s_str as posted, a simple .encode('latin1') would convert it:

>>> s_str: str = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
>>> s_str.encode('latin1')
b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, this solves the issue. I was reading this from a file using open(file,'r') and I guess that creates a raw string.
And is there a way of reading from a file containing (raw text) either b'\x00\x01\x00\xc0\x01\x00\x00\x00\x04'or \x00\x01\x00\xc0\x01\x00\x00\x00\x04 so that it will be considered directly a string of bytes of length 9?
@TeresodelRíoAlmajano Reading a file doesn’t create a raw string. Raw strings are a way of creating string literals in code without interpreting escape codes. Your file had text with escape-code-like text. You can open(file,encoding='unicode_escape') if needed, but it would be better to post an actual sample of the file in case their is a better solution.
"and I guess that creates a raw string" It does not "create a raw string". There is not such a thing as a "raw string". However, reading a file into a string does mean that the string contains what the file actually contains - if there's a backslash followed by a lowercase n, then it's a backslash followed by a lowercase n, not a newline. Escape sequences only apply to string literals in your source code, unless you explicitly do something to interpret them. They apply before the code runs.
0

I was about to post the question when I encounter a valid solution almost by chance. The combination that works for me is:

s_new: bytes = bytes(s_str.encode('utf-8').decode('unicode-escape'), encoding="oem")

As I said I have no idea why this works so feel free to explain it if you know why.

Comments

0

You might simply use .encode("utf-8") to get desired result i.e.:

s_1 = "\x00\x01\x00\xc0\x01\x00\x00\x00\x04"
s_2 = s_1.encode("utf-8")
print(s_2)

output

b'\x00\x01\x00\xc3\x80\x01\x00\x00\x00\x04'

3 Comments

No, that does not solve the problem of the double backslash. At least for me.
@TeresodelRíoAlmajano what version of python are you using?
I am using Python 3.9. But the other answer explains why I was not getting the desired result.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.