3

I am using Ruby 2.3:

I have the following string: "\xFF\xFE"

I do a File.binread() on a file containing it, so the encoding of this string is ASCII-8BIT. However, in my code, i check to see whether this string was indeed read by comparing it to the literal string "\xFF\xFE" (which has encoding UTF-8 as all Ruby strings have by default).

However, the comparison returns false, even though both strings contain the same bytes - it just happens that one is with encoding ASCII-8BIT and the other is UTF-8

I have two questions: (1) why does it return false ? and (2) what is the best way to go about achieving what i want? I just want to check whether the string I read matches "\xFF\xFE"

1
  • If you just want to read a file with a Unicode BOM, you can pass an encoding of 'BOM|UTF-8' and let Ruby handle it automatically. Commented Oct 18, 2017 at 6:33

1 Answer 1

5

(1) why does it return false?

When comparing strings, they either have to be in the same encoding or their characters must be encodable in US-ASCII.

Comparison works as expected if the string only contains byte values 0 to 127: (0b0xxxxxxx)

a = 'E'.encode('ISO8859-1')  #=> "E"
b = 'E'.encode('ISO8859-15') #=> "E"

a.bytes #=> [69]
b.bytes #=> [69]
a == b  #=> true

And fails if it contains any byte values 128 to 255: (0b1xxxxxxx)

a = 'É'.encode('ISO8859-1')  #=> "\xC9"
b = 'É'.encode('ISO8859-15') #=> "\xC9"

a.bytes #=> [201]
b.bytes #=> [201]
a == b  #=> false

Your string can't be represented in US-ASCII, because both its bytes are outside its range:

"\xFF\xFE".bytes #=> [255, 254]

Attempting to convert it doesn't produce any meaningful result:

"\xFF\xFE".encode('US-ASCII', 'ASCII-8BIT', :undef => :replace)
#=> "??"

The string will therefore return false when being compared to a string in another encoding, regardless of its content.

(2) what is the best way to go about achieving what i want?

You could compare your string to a string with the same encoding. binread returns a string in ASCII-8BIT encoding, so you could use b to create a compatible one:

IO.binread('your_file', 2) == "\xFF\xFE".b

or you could compare its bytes:

IO.binread('your_file', 2).bytes == [0xFF, 0xFE]
Sign up to request clarification or add additional context in comments.

7 Comments

But it's 8 bit ascii and the characters 255 and 254 are defined. So what's up with that? "\xFF\xFE".encode('ASCII-8BIT') works just fine. Is it because it's not valid UTF-8 ?
There is no such thing as "8 bit ascii". ASCII is, has always been, and will always be 7 bit.
@banister you are confusing 'ASCII-8BIT' with 'US-ASCII' and my answer wasn't very precise in that regard either. I've updated it accordingly.
@JörgWMittag ok then "extended ascii is 8 bit" is that ok? :) From wikipedia, "The extended ASCII character set uses 8 bits"
@horseyguy Wikipedia says that? "extended ASCII" is an umbrella term referring to any encoding built on top of ASCII. Even UTF-8 is extended ASCII.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.