UTF-8 ruby encoding

Question

I've got this string: WinterIDäSchwiiz, which comes from an API and I want to search for it in the database. Now it turns out that this string has a different encoding than how its saved in my database. Yet ruby says the encoding for both is utf-8. What is going on?

I've figured out the most terrible way to fix this problem by going down to the bytesequence and replace the bytes representing the "ä" with a different bytesequence and then forceencoding it to utf8. It works but hurts my eyes. Does anyone have a better solution than:

 "WinterIDäSchwiiz".bytes.join(",").gsub("97,204,136","195,164").split(",").collect{|s| s.to_i}.pack('C*').force_encoding('utf-8')

There is not enough information in your question. What is the correct encoding of the string, if not UTF-8? What encoding is reported by the API's HTTP headers? What encoding is set on the database table? What code is responsible for retrieving the data from the API? What code inserts it into the database? What code retrieves it from the database? — Jordan Running
– Jordan Running, Commented Jan 13, 2016 at 21:29

Jordan Running · Accepted Answer · 2016-01-14 01:38:04Z

Your string is UTF-8.

I can tell because your fix is to replace the bytes (97, 204, 136) with the bytes (195, 164).

The first byte you're replacing, 97 (0x61) is the UTF-8 character a. The second two bytes, 204 and 136 (0xCC 0x88), are the bytes for the UTF-8 character U+0308, the combining diaeresis: ̈. The two characters combine to form ä.

The bytes you're expecting are 195 and 164 (0xC3 0xA4) which, together, are U+00E4, or Latin small letter "a" with diaeresis.

Both are UTF-8. One prints ä and the other prints ä. This is an example of Unicode equivalence.

In other words:

str1 = "a\xCC\x88"
puts str1 # => ä
p str1.bytes # => [97, 204, 136]
p str1.encoding # => #<Encoding:UTF-8>

str2 = "\xC3\xA4"
puts str2 # => ä
p str2.bytes # => [195, 164]
p str2.encoding # => #<Encoding:UTF-8>

Fortunately, we have Unicode normalization to help deal with this. This is a big topic, but the very, very insufficient TL;DR is that the Unicode consortium has prescribed standard ways to normalize strings like the above, i.e. how to turn str1 into str2.

Unfortunately, it's impossible to say what the best solution for you is, since you didn't provide any details. Your database might have built-in normalization functionality, but I don't know what database you're using so I can't say. Since you did mention Ruby I can point you to the String#unicode_normalize method, which was introduced in Ruby's standard library in Ruby 2.2:

str1 = "a\xCC\x88"
str2 = "\xC3\xA4"
p str1 == str2 # => false

str1_normalized = str1.unicode_normalize

p str1_normalized == str2
# => true
p str1_normalized.bytes == str2.bytes
# => true

If you don't have Ruby 2.2+, well... upgrade. But if you can't upgrade for some reason you can use ActiveSupport::Multibyte::Unicode.normalize, which is especially convenient if you're using Rails, or the Unicode gem.

One more thing

You don't need to do this, since the above is the correct way to do Unicode normalization in Ruby, but a much easier way to do this:

"WinterIDäSchwiiz".bytes.join(",").gsub("97,204,136","195,164").split(",").collect{|s| s.to_i }.pack('C*').force_encoding('utf-8')

...would have been this:

"WinterIDäSchwiiz".gsub("a\xCC\x88", "\xC3\xA4")

Any time you see something like join(",")...split(",") in Ruby it's almost certainly the wrong solution.

That is an excellent answer! Thanks a lot I've learned a lot!

Collectives™ on Stack Overflow

UTF-8 ruby encoding

1 Answer 1

One more thing

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

One more thing

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related