I'm running into some strange behaviour and inconsistency in the way that Ruby (v2.5.3) deals with encoded strings versus the YAML parser. Here's an example:
"\x80" # Returns "\x80"
"\x80".bytesize # Returns 1
"\x80".bytes # Returns [128]
"\x80".encoding # Returns UTF-8
YAML.load('{value: "\x80"}')["value"] # Returns "\u0080"
YAML.load('{value: "\x80"}')["value"].bytesize # Returns 2
YAML.load('{value: "\x80"}')["value"].bytes # Returns [194, 128]
YAML.load('{value: "\x80"}')["value"].encoding # Returns UTF-8
My understanding of UTF-8 is that any single-byte value above 0x7F should be encoded into two bytes. So my questions are the following:
- Is the one byte string
"\x80"valid UTF-8? - If so, why does YAML convert into a two-byte pattern?
- If not, why is Ruby claiming the encoding is UTF-8 but containing an invalid byte sequence?
- Is there a way to make the YAML parser and the Ruby string behave in the same way as each other?
"\x80".valid_encoding?definitely invalid. Not sure what YAML is doing though