Ruby, problems comparing strings with UTF-8 characters

Question

I have these 2 UTF-8 strings:

a = "N\u01b0\u0303"
b = "N\u1eef"

They look pretty different but the are the same once they are rendered:

irb(main):039:0> puts "#{a} - #{b}"
Nữ - Nữ

The a version is the one I have stored in the DB. The b version is the one is coming from the browser in a POST request, I don't know why the browser is sending a different combination of UTF8 characters, and it is not happening always, I can't reproduce the issue in my dev environment, it happens in production and in a percentage of the total requests.

The case is that I try to compare both of them but they return false:

irb(main):035:0> a == b
=> false

I've tried different things like forcing encoding:

irb(main):022:0> c.force_encoding("UTF-8") == a.force_encoding("UTF-8")
=> false

Another interesting fact is:

irb(main):005:0> a.chars
=> ["N", "ư", "̃"]
irb(main):006:0> b.chars
=> ["N", "ữ"]

How can I compare these kind of strings?

Do you get a and b from same browser and os? Looks like specific browser/os character rendering issue to me. Probably you can try spot substitution table and then make reverse substitution. — Kirill Fedyanin
– Kirill Fedyanin, Commented Nov 24, 2015 at 15:32

matt · Accepted Answer · 2017-10-23 13:34:54Z

13

This is an issue with Unicode equivalence.

The a version of your string consists of the character ư (U+01B0: LATIN SMALL LETTER U WITH HORN), followed by U+0303 COMBINING TILDE. This second character, as the name suggests is a combining character, which when rendered is combined with the previous character to produce the final glyph.

The b version of the string uses the character ữ (U+1EEF, LATIN SMALL LETTER U WITH HORN AND TILDE) which is a single character, and is equivalent to the previous combination, but uses a different byte sequence to represent it.

In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters. Current versions of Ruby have this built in (in earlier versions you needed to use a third party library).

So currently you have

a == b

which is false, but if you do

a.unicode_normalize == b.unicode_normalize

you should get true.

If you are on an older version of Ruby, there are a couple of options. Rails has a normalize method as part of its multibyte support, so if you are using Rails you can do:

a.mb_chars.normalize == b.mb_chars.normalize

or perhaps something like:

ActiveSupport::Multibyte::Unicode.normalize(a) == ActiveSupport::Multibyte::Unicode.normalize(b)

If you’re not using Rails, then you could look at the unicode_utils gem, and do something like this:

UnicodeUtils.nfkc(a) == UnicodeUtils.nfkc(b)

(nfkc refers to the normalisation form, it is the same as the default in the other techniques.)

There are various different ways to normalise unicode strings (i.e. whether you use the decomposed or combined versions), and this example just uses the default. I’ll leave researching the differences to you.

edited Oct 23, 2017 at 13:34

answered Nov 24, 2015 at 15:41

matt

80k8 gold badges169 silver badges199 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

fguillen Over a year ago

I'm using Ruby 2.0.0p247 which looks it has not this module integrated. Any third part library recommended? I've found this one but not any start on Github and also I have problems to install it.

matt Over a year ago

@fguillen I’ve updated by answer with some suggestions. Your question is tagged with Rails, so using Rails’ support would probably the best solution here I think.

fguillen Over a year ago

you are right I didn't have thought in the Rails' internal Unicode module. I have added example for this escenario into your answer, please correct it if is not right.

Martin Konecny · Accepted Answer · 2015-11-24 17:22:19Z

You can see these are distinct characters. First and second. In the first case, it is using a modifier "combining tilde".

Wikipedia has a section on this:

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

and

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.

It seems that Ruby supports this normalization, but only as of Ruby 2.2:

http://ruby-doc.org/stdlib-2.2.0/libdoc/unicode_normalize/rdoc/String.html

a = "N\u01b0\u0303".unicode_normalize
b = "N\u1eef".unicode_normalize

a == b  # true

Alternatively, if you are using Ruby on Rails, there appears to be a built-in method for normalization.

Collectives™ on Stack Overflow

Ruby, problems comparing strings with UTF-8 characters

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related