1

I'm running into some strange behaviour and inconsistency in the way that Ruby (v2.5.3) deals with encoded strings versus the YAML parser. Here's an example:

"\x80"          # Returns "\x80"
"\x80".bytesize # Returns 1
"\x80".bytes    # Returns [128]
"\x80".encoding # Returns UTF-8

YAML.load('{value: "\x80"}')["value"]          # Returns "\u0080"
YAML.load('{value: "\x80"}')["value"].bytesize # Returns 2
YAML.load('{value: "\x80"}')["value"].bytes    # Returns [194, 128]
YAML.load('{value: "\x80"}')["value"].encoding # Returns UTF-8

My understanding of UTF-8 is that any single-byte value above 0x7F should be encoded into two bytes. So my questions are the following:

  1. Is the one byte string "\x80" valid UTF-8?
  2. If so, why does YAML convert into a two-byte pattern?
  3. If not, why is Ruby claiming the encoding is UTF-8 but containing an invalid byte sequence?
  4. Is there a way to make the YAML parser and the Ruby string behave in the same way as each other?
2
  • 1
    "\x80".valid_encoding? definitely invalid. Not sure what YAML is doing though Commented Mar 14, 2019 at 21:06
  • Excellent point -- I assumed it would throw an error or something when you tried to set it, but I guess it just allows the string and says it's invalid if you bother to check. Commented Mar 14, 2019 at 21:18

1 Answer 1

3

It is not valid UTF-8

"\x80".valid_encoding?
# false

Ruby is claiming it is UTF-8 because all String literals are UTF-8 by default, even if that makes them invalid.

I don't think you can force the YAML parser to return invalid UTF-8. But to get Ruby to convert that character you can do this

"\x80".b.ord.chr('utf-8')
# "\u0080"

.b is only available in Ruby 2+. You need to use force_encoding otherwise.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.