1

I have the same question like here. But couldn't comment there because of not enough credits so creating new question.

Anyway the question is:

I want to compare two strings in ruby in the similar way how mysql compares two string with collation utf_general_ci.

To go in detail, when the collation utf_general_ci is selected in db, the mysql treat 'a' and 'ä' are same while executing queries. Since I want a batch inserts I am pulling all names (column with utf_general_ci collation) into ruby script and making insert statements if not name not exists. But during comparison in ruby the characters like 'a' and 'ä' are treated different. But I want the comparison to be implemented in a similar way how mysql does it in case of utf_general_ci collation.

In the old question there was an answer using 'iconv', which is deprecated after 1.9.3. So I think String#encode should be used in doing same. But couldn't find exact way how to replicate that behavior.

5
  • Why not simply issue a MySQL command to get the comparison? Commented Jan 14, 2016 at 6:10
  • @RickJames yah I can issue MySQL command to first search if exists and insert if not exists in table. But that is taking hell lot of time since I have large data. So thats why Im trying to form batch of insert statements and then upload to MySQL DB. Commented Jan 19, 2016 at 17:08
  • INSERT ... ON DUPLICATE KEY UPDATE ... avoids the need to check if a row exists before inserting it. Commented Jan 20, 2016 at 0:56
  • But for every entry I want to avoid 'insert' query since its taking lot of time. Commented Jan 20, 2016 at 14:32
  • But the SELECT has to spend the time to do the check. By combining the check and the insert/update, it actually saves time. Perhaps I can be clearer (or possibly wrong) if you can show the queries you have now. Commented Jan 20, 2016 at 19:42

1 Answer 1

2

AFAIK, there is no straight way of doing this in ruby at the moment. On the other hand, one might simply do it by hands. The ninja way is to use icu library for this.

Turning out you probably want the simplest way, and the only goal is to compare strings, one could start with getting rid of accents. There are two possibilities of having accents: combining diacritical and latin supplement. The latter is a legacy of Latin1/ISO-8859-1 encoding.

Getting rid of combining diacritics is easy:

▶ "lätin1, cömbined".gsub(Regexp.new(("\u0300".."\u036f").to_a.join('|')), '')
#⇒ "lätin1, combined"

OK, that was the easiest part. Unfortunately, there is no straight way to get a mapping of legacy latin1 characters to their unaccented equivalents, so one would need to introduce it herself:

▶ substs = "ÀÁÂÃÄÅ".split(//).product(['A']).to_h
# for the sake of focusing on the problem, the other symbols are dropped

Now the comparison might be done as:

▶  "lÄtin1, cömbined".gsub(Regexp.new(("\u0300".."\u036f").to_a.join('|')), '')
                     .gsub(Regexp.new(substs.keys.join('|')), substs)
#⇒ "lAtin1, combined"

Hence, two strings might be “dediacritized” and then compared.

Please note, that I admit this approach is wrong. One should use proper binding to icu library, but the above does the trick when you understand what you are doing and works out of the box with minimal effort.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.