8

In IRB, I'm trying the following:

1.9.3p194 :001 > foo = "\xBF".encode("utf-8", :invalid => :replace, :undef => :replace)
 => "\xBF" 
1.9.3p194 :002 > foo.match /foo/
ArgumentError: invalid byte sequence in UTF-8
from (irb):2:in `match'

Any ideas what's going wrong?

3 Answers 3

23

I'd guess that "\xBF" already thinks it is encoded in UTF-8 so when you call encode, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:

>> s = "\xBF"
=> "\xBF"
>> s.encoding
=> #<Encoding:UTF-8>

\xBF isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of encode:

encode(dst_encoding, src_encoding [, options] ) → str

[...] The second form returns a copy of str transcoded from src_encoding to dst_encoding.

You can force the issue by telling encode to ignore what the string thinks its encoding is and treat it as binary data:

>> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "�"

Where s is the "\xBF" that thinks it is UTF-8 from above.

You could also use force_encoding on s to force it to be binary and then use the two-argument encode:

>> s.encoding
=> #<Encoding:UTF-8>
>> s.force_encoding('binary')
=> "\xBF"
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace)
=> "�"
Sign up to request clarification or add additional context in comments.

1 Comment

@drewinglis: I like the explicitness of "binary" (which is an alias for "ascii-8bit"), "ascii" isn't exactly the same.
5

If you're only working with ascii characters you can use

>> "Hello \xBF World!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "Hello � World!"

But what happens if we use the same approach with valid UTF8 characters that are invalid in ascii

>> "¡Hace \xBF mucho frío!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "��Hace � mucho fr��o!"

Uh oh! We want frío to remain with the accent. Here's an option that keeps the valid UTF8 characters

>> "¡Hace \xBF mucho frío!".chars.select{|i| i.valid_encoding?}.join
=> "¡Hace  mucho frío!"

Also in Ruby 2.1 there is a new method called scrub that solves this problem

>> "¡Hace \xBF mucho frío!".scrub
=> "¡Hace � mucho frío!"
>> "¡Hace \xBF mucho frío!".scrub('')
=> "¡Hace  mucho frío!"

Comments

2

This is fixed if you read the source text file in using an explicit code page:

File.open( 'thefile.txt', 'r:iso8859-1' )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.