Ruby: compare two strings in terms of database collation utf8_general_ci

Question

I have the same question like here. But couldn't comment there because of not enough credits so creating new question.

Anyway the question is:

I want to compare two strings in ruby in the similar way how mysql compares two string with collation utf_general_ci.

To go in detail, when the collation utf_general_ci is selected in db, the mysql treat 'a' and 'ä' are same while executing queries. Since I want a batch inserts I am pulling all names (column with utf_general_ci collation) into ruby script and making insert statements if not name not exists. But during comparison in ruby the characters like 'a' and 'ä' are treated different. But I want the comparison to be implemented in a similar way how mysql does it in case of utf_general_ci collation.

In the old question there was an answer using 'iconv', which is deprecated after 1.9.3. So I think String#encode should be used in doing same. But couldn't find exact way how to replicate that behavior.

@RickJames yah I can issue MySQL command to first search if exists and insert if not exists in table. But that is taking hell lot of time since I have large data. So thats why Im trying to form batch of insert statements and then upload to MySQL DB. — santoshthota
– santoshthota, Commented Jan 19, 2016 at 17:08
INSERT ... ON DUPLICATE KEY UPDATE ... avoids the need to check if a row exists before inserting it. — Rick James
– Rick James, Commented Jan 20, 2016 at 0:56
But for every entry I want to avoid 'insert' query since its taking lot of time. — santoshthota
– santoshthota, Commented Jan 20, 2016 at 14:32
But the SELECT has to spend the time to do the check. By combining the check and the insert/update, it actually saves time. Perhaps I can be clearer (or possibly wrong) if you can show the queries you have now. — Rick James
– Rick James, Commented Jan 20, 2016 at 19:42

Aleksei Matiushkin · Accepted Answer · 2015-12-31 08:24:50Z

AFAIK, there is no straight way of doing this in ruby at the moment. On the other hand, one might simply do it by hands. The ninja way is to use icu library for this.

Turning out you probably want the simplest way, and the only goal is to compare strings, one could start with getting rid of accents. There are two possibilities of having accents: combining diacritical and latin supplement. The latter is a legacy of Latin1/ISO-8859-1 encoding.

Getting rid of combining diacritics is easy:

▶ "lätin1, cömbined".gsub(Regexp.new(("\u0300".."\u036f").to_a.join('|')), '')
#⇒ "lätin1, combined"

OK, that was the easiest part. Unfortunately, there is no straight way to get a mapping of legacy latin1 characters to their unaccented equivalents, so one would need to introduce it herself:

▶ substs = "ÀÁÂÃÄÅ".split(//).product(['A']).to_h
# for the sake of focusing on the problem, the other symbols are dropped

Now the comparison might be done as:

▶  "lÄtin1, cömbined".gsub(Regexp.new(("\u0300".."\u036f").to_a.join('|')), '')
                     .gsub(Regexp.new(substs.keys.join('|')), substs)
#⇒ "lAtin1, combined"

Hence, two strings might be “dediacritized” and then compared.

Please note, that I admit this approach is wrong. One should use proper binding to icu library, but the above does the trick when you understand what you are doing and works out of the box with minimal effort.

Collectives™ on Stack Overflow

Ruby: compare two strings in terms of database collation utf8_general_ci

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related