I have a YAML file with test cases for encoding and decoding elements. The left-hand side represents the expected encoded bytes, and the right-hand side contains the original element. For example, the VarInt test cases are:
examples:
"\0": 0
"\u0001": 1
"\u000A": 10
"\u00c8\u0001": 200
"\u00e8\u0007": 1000
"\u00a9\u0046": 9001
"\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u0001": -1
The encodings (left-hand side) for the first three examples work correctly when read as strings (which are automatically interpreted as UTF-8 in Rust).
However, the fourth example (200) and the subsequent ones don't yield the correct results. Using the encoding for 200 ("\u00c8\u0001") as an example:
Reading the encoding as a UTF-8 string (incorrect):
use bytes::{Buf, BufMut};
let encoding_as_utf8_string = "\u{00c8}\u{0001}";
// "È\u{1}"
println!("Encoding as a UTF-8 string: {:?}", encoding_as_utf8_string);
let mut utf8_bytes: &[u8] = encoding_as_utf8_string.as_bytes();
// [195, 136, 1] (Incorrect)
println!("Bytes obtained from the encoding when read as a UTF-8 string: {:?}", utf8_bytes);
Reading the encoding as a byte array (correct):
use bytes::{Buf, BufMut};
let string_from_byte_array: String;
unsafe {
let encoding_as_byte_array: &[u8; 2] = b"\xc8\x01";
string_from_byte_array = String::from_utf8_unchecked(encoding_as_byte_array.to_vec());
}
// "�"
println!("Encoding string read from byte array: {:?}", string_from_byte_array);
let mut bytes: &[u8] = string_from_byte_array.as_bytes();
// [200, 1] (correct)
println!("Bytes obtained from the encoding when read as a byte array: {:?}", bytes);
The issue here is that when reading from the YAML file, the encodings (Mapping keys) get automatically interpreted as UTF-8 strings, so the original bytes are lost:
use serde::Deserialize;
use serde_yaml::{Deserializer, Value};
let f = std::fs::read(yaml_dir).expect("Unable to read file");
for doc in Deserializer::from_slice(&f) {
let spec = Value::deserialize(doc).expect("Unable to parse document");
// Mapping {..., "examples": Mapping {"\0": Number(0), "\u{1}": Number(1), "\n": Number(10), "È\u{1}": Number(200), "è\u{7}": Number(1000), "©F": Number(9001), "ÿÿÿÿÿÿÿÿÿ\u{1}": Number(-1)}}
println!("YAML spec interpreted: {:?}", spec);
}
A more specific example using serde_yaml:
// Sequence [Number(200), Number(1)] (Correct, but how to make the YAML get interpreted like this?)
let bytes = serde_yaml::to_value(b"\xc8\x01").unwrap();
// String("È\u{1}") (Incorrect)
let st = serde_yaml::to_value("\u{00c8}\u{0001}").unwrap();
I'm using serde_yaml but any other approach would be acceptable. How can I make it so that the encodings in the YAML, exactly as they are written, are correctly interpreted as byte arrays instead of strings?
I know serde_yaml has methods such as deserialize_bytes, but I'm not sure how to apply them in this case.
Alternatively, is there a way to continue reading the encodings normally as UTF-8 strings and then extract the original non-UTF-8 bytes from them?
Vec<u8>, notString?È\u{1}is correct, that's\u00c8\u0001. If you want to represent a number using unicode codepoints, you have to write a parser yourself, I do not understand the expectancy for the parser to convert it for you.[195, 136, 1] (Incorrect)that's correct utf-8 forÈ\u1. You want utf-32? that's not bytes, that's words.serde_yamlto interpret map keys as anything butString? I've triedbstrand even a custom deserializer, and I can't see a way of getting aroundexamples: invalid type: string \"\\0\", expected a sequence.