Why is a UTF-8 string not equal to the equivalent ASCII-8BIT string in Ruby 2.0?

Question

I am using Ruby 2.3:

I have the following string: "\xFF\xFE"

I do a File.binread() on a file containing it, so the encoding of this string is ASCII-8BIT. However, in my code, i check to see whether this string was indeed read by comparing it to the literal string "\xFF\xFE" (which has encoding UTF-8 as all Ruby strings have by default).

However, the comparison returns false, even though both strings contain the same bytes - it just happens that one is with encoding ASCII-8BIT and the other is UTF-8

I have two questions: (1) why does it return false ? and (2) what is the best way to go about achieving what i want? I just want to check whether the string I read matches "\xFF\xFE"

If you just want to read a file with a Unicode BOM, you can pass an encoding of 'BOM|UTF-8' and let Ruby handle it automatically. — Stefan
– Stefan, Commented Oct 18, 2017 at 6:33

Stefan · Accepted Answer · 2017-10-18 09:39:47Z

5

(1) why does it return false?

When comparing strings, they either have to be in the same encoding or their characters must be encodable in US-ASCII.

Comparison works as expected if the string only contains byte values 0 to 127: (0b0xxxxxxx)

a = 'E'.encode('ISO8859-1')  #=> "E"
b = 'E'.encode('ISO8859-15') #=> "E"

a.bytes #=> [69]
b.bytes #=> [69]
a == b  #=> true

And fails if it contains any byte values 128 to 255: (0b1xxxxxxx)

a = 'É'.encode('ISO8859-1')  #=> "\xC9"
b = 'É'.encode('ISO8859-15') #=> "\xC9"

a.bytes #=> [201]
b.bytes #=> [201]
a == b  #=> false

Your string can't be represented in US-ASCII, because both its bytes are outside its range:

"\xFF\xFE".bytes #=> [255, 254]

Attempting to convert it doesn't produce any meaningful result:

"\xFF\xFE".encode('US-ASCII', 'ASCII-8BIT', :undef => :replace)
#=> "??"

The string will therefore return false when being compared to a string in another encoding, regardless of its content.

(2) what is the best way to go about achieving what i want?

You could compare your string to a string with the same encoding. binread returns a string in ASCII-8BIT encoding, so you could use b to create a compatible one:

IO.binread('your_file', 2) == "\xFF\xFE".b

or you could compare its bytes:

IO.binread('your_file', 2).bytes == [0xFF, 0xFE]

edited Oct 18, 2017 at 9:39

answered Oct 18, 2017 at 6:31

Stefan

115k14 gold badges157 silver badges234 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

horseyguy Over a year ago

But it's 8 bit ascii and the characters 255 and 254 are defined. So what's up with that? "\xFF\xFE".encode('ASCII-8BIT') works just fine. Is it because it's not valid UTF-8 ?

Jörg W Mittag Over a year ago

There is no such thing as "8 bit ascii". ASCII is, has always been, and will always be 7 bit.

Stefan Over a year ago

@banister you are confusing 'ASCII-8BIT' with 'US-ASCII' and my answer wasn't very precise in that regard either. I've updated it accordingly.

horseyguy Over a year ago

@JörgWMittag ok then "extended ascii is 8 bit" is that ok? :) From wikipedia, "The extended ASCII character set uses 8 bits"

Stefan Over a year ago

@horseyguy Wikipedia says that? "extended ASCII" is an umbrella term referring to any encoding built on top of ASCII. Even UTF-8 is extended ASCII.

|

Collectives™ on Stack Overflow

Why is a UTF-8 string not equal to the equivalent ASCII-8BIT string in Ruby 2.0?

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related