0

I have a YAML file with test cases for encoding and decoding elements. The left-hand side represents the expected encoded bytes, and the right-hand side contains the original element. For example, the VarInt test cases are:

examples:
"\0": 0
"\u0001": 1
"\u000A": 10
"\u00c8\u0001": 200
"\u00e8\u0007": 1000
"\u00a9\u0046": 9001
"\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u0001": -1

The encodings (left-hand side) for the first three examples work correctly when read as strings (which are automatically interpreted as UTF-8 in Rust).

However, the fourth example (200) and the subsequent ones don't yield the correct results. Using the encoding for 200 ("\u00c8\u0001") as an example:

Reading the encoding as a UTF-8 string (incorrect):

use bytes::{Buf, BufMut};

let encoding_as_utf8_string = "\u{00c8}\u{0001}";
// "È\u{1}"
println!("Encoding as a UTF-8 string: {:?}", encoding_as_utf8_string);

let mut utf8_bytes: &[u8] = encoding_as_utf8_string.as_bytes();
// [195, 136, 1] (Incorrect)
println!("Bytes obtained from the encoding when read as a UTF-8 string: {:?}", utf8_bytes);

Reading the encoding as a byte array (correct):

use bytes::{Buf, BufMut};

let string_from_byte_array: String;
unsafe {
    let encoding_as_byte_array: &[u8; 2] = b"\xc8\x01";
    string_from_byte_array = String::from_utf8_unchecked(encoding_as_byte_array.to_vec());
}

// "�"
println!("Encoding string read from byte array: {:?}", string_from_byte_array);

let mut bytes: &[u8] = string_from_byte_array.as_bytes();
// [200, 1] (correct)
println!("Bytes obtained from the encoding when read as a byte array: {:?}", bytes);

The issue here is that when reading from the YAML file, the encodings (Mapping keys) get automatically interpreted as UTF-8 strings, so the original bytes are lost:

use serde::Deserialize;
use serde_yaml::{Deserializer, Value};

let f = std::fs::read(yaml_dir).expect("Unable to read file");
    
for doc in Deserializer::from_slice(&f) {
    let spec = Value::deserialize(doc).expect("Unable to parse document");

    // Mapping {..., "examples": Mapping {"\0": Number(0), "\u{1}": Number(1), "\n": Number(10), "È\u{1}": Number(200), "è\u{7}": Number(1000), "©F": Number(9001), "ÿÿÿÿÿÿÿÿÿ\u{1}": Number(-1)}}
    println!("YAML spec interpreted: {:?}", spec);
}

A more specific example using serde_yaml:

// Sequence [Number(200), Number(1)] (Correct, but how to make the YAML get interpreted like this?)
let bytes = serde_yaml::to_value(b"\xc8\x01").unwrap();

// String("È\u{1}") (Incorrect)
let st = serde_yaml::to_value("\u{00c8}\u{0001}").unwrap();

I'm using serde_yaml but any other approach would be acceptable. How can I make it so that the encodings in the YAML, exactly as they are written, are correctly interpreted as byte arrays instead of strings?

I know serde_yaml has methods such as deserialize_bytes, but I'm not sure how to apply them in this case.

Alternatively, is there a way to continue reading the encodings normally as UTF-8 strings and then extract the original non-UTF-8 bytes from them?

6
  • 2
    Note that having strings that represent non utf-8 valid strings is UB. Which suggests that maybe what you want is Vec<u8>, not String? Commented Dec 31, 2022 at 16:38
  • You're right. Unsafe behavior would be acceptable for this use case, but obtaining a Vec<u8> would be better. Commented Dec 31, 2022 at 16:47
  • 1
    È\u{1} is correct, that's \u00c8\u0001. If you want to represent a number using unicode codepoints, you have to write a parser yourself, I do not understand the expectancy for the parser to convert it for you. [195, 136, 1] (Incorrect) that's correct utf-8 for È\u1. You want utf-32? that's not bytes, that's words. Commented Dec 31, 2022 at 16:47
  • 1
    @NivaldoT UB is never acceptable. Maybe you don't exactly know what it means (in Rust): if you ever reach UB, it means the compiler makes no promises about what might happen. It doesn't mean "now, if you make a call to an other function, there might be a problem because", it means "Rust has to right to have produced code that instantly blows up your computer". It's likely not what will happen in this case, but it is never a good thing to reach UB. Commented Dec 31, 2022 at 16:53
  • @jthulhu Is there a way to get serde_yaml to interpret map keys as anything but String? I've tried bstr and even a custom deserializer, and I can't see a way of getting around examples: invalid type: string \"\\0\", expected a sequence. Commented Dec 31, 2022 at 17:22

2 Answers 2

2

\u00c8 is UTF-16 for character È. That's not 200. That's character È. You have written character È. Not 200.

195, 136 or 0xC3 0x88 is UTF-8 for character È. This is how character È is represented as bytes in Rust.

If you want to print UTF-16 of a character, you want to print u16, not u8. Try:

fn main() {
    let st = serde_yaml::to_value("\u{00c8}\u{0001}").unwrap();
    let v: Vec<u16> = st.as_str().unwrap().encode_utf16().collect();
    println!("{} {}", v[0], v[1]);
}
Sign up to request clarification or add additional context in comments.

Comments

1

A superficial reading of serde_yaml's code suggests that it will always try to convert your YAML string keys to str (which must fail since they're not valid utf8) and you can't get a [u8] out of them. I suggest you change your YAML:

examples:
  [0]: 0
  [0x01]: 1
  [0x0A]: 10
  [0xc8, 0x01]: 200
  [0xe8, 0x07]: 1000
  [0xa9, 0x46]: 9001
  [0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x01]: -1

This can be parsed by serde_yaml, but alas, you said you don't want to do that.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.