MySQL matching unicode characters with ascii version

Question

I'm running MySQL 5.1.50 and have a table that looks like this:

organizations | CREATE TABLE `organizations` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  `url` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `phone` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=25837 DEFAULT CHARSET=utf8 |

The problem I'm having is that MySQL is matching unicode characters with ascii versions. For example when I search for a word with that contains an 'é', it will match the same word that has an 'e' instead, and vice versa:

mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT id, name FROM `organizations` WHERE `name` = 'Universite de Montreal';
    +-------+-------------------------+
| id    | name                    |
+-------+-------------------------+
| 16973 | Université de Montreal  |
+-------+-------------------------+
1 row in set (0.01 sec)

I get these results both from PHP and the command line console. How can I get accurate matches from my SELECT queries?

Thanks!

user213154 · Accepted Answer · 2011-07-06 20:15:12Z

You specified the name column as text CHARACTER SET utf8 COLLATE utf8_unicode_ci which tells MySQL to consider e and é as equivalent in matching and sorting. That collation and utf8_general_ci both make a lot of things equivalent.

http://www.collation-charts.org/ is a great resource once you learn how to read the charts, which is pretty easy.

If you want e and é etc. to be considered different then you must choose a different collation. To find out what collations are on your server (assuming you're limited to UTF-8 encoding):

mysql> show collation like 'utf8%';

And choose using the collation charts as a reference.

One more special collation is utf8_bin in which there are no equivalencies, it's a binary match.

The only MySQL Unicode collations I'm aware of that are not language specific are utf8_unicode_ci, utf8_general_ci and utf8_bin. They are rather weird. The real purpose of a collation is to make the computer match and sort as a person from somewhere would expect. Hungarian and Turkish dictionaries have their entries ordered according to different rules. Specifying a collation allows you to sort and match according to such local rules.

For example, it seems Danes consider e and é equivalent but Icelanders don't:

mysql> select _utf8'e' collate utf8_danish_ci
    -> = _utf8'é' collate utf8_danish_ci as equal;
+-------+
| equal |
+-------+
|     1 |
+-------+

mysql> select _utf8'e' collate utf8_icelandic_ci
    -> = _utf8'é' collate utf8_icelandic_ci as equal;
+-------+
| equal |
+-------+
|     0 |
+-------+

Another handy trick is to fill a one column table with a bunch of characters you're interested in (it's easier from a script) and then MySQL can tell you the equivalencies:

mysql> create table t (c char(1) character set utf8);
mysql> insert into t values ('a'), ('ä'), ('á');
mysql> select group_concat(c) from t group by c collate utf8_icelandic_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a               |
| á               |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_danish_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,á             |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_general_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,ä,á           |
+-----------------+

user1068477 · Accepted Answer · 2013-10-09 01:13:57Z

8

Of course, this will work:

SELECT * FROM table WHERE name LIKE BINARY 'namé';

answered Oct 9, 2013 at 1:13

user1068477

1 Comment

user502255 Over a year ago

I tried all kinds of collation variants and '%º%' (ordinal indicator, not degree symbol) kept matching things it shouldn't. Saw this and tried it and it worked like a charm. Thanks!

dland · Accepted Answer · 2014-10-15 12:06:38Z

2

one thing you can do with your query string is to decode it...

< ?php
$query="उनकी"; // some Unicode characters
$query=urldecode($query);
$qry= "SELECT * FROM table WHERE books LIKE '%$query%'";

//rest of the code....
?>

it worked for me. :)

edited Oct 15, 2014 at 12:06

dland

4,4487 gold badges40 silver badges61 bronze badges

answered Jan 2, 2012 at 14:05

Amit Kumar Khare

5736 silver badges17 bronze badges

Comments

borrible · Accepted Answer · 2011-07-01 21:39:00Z

1

You have set collation to utf8_unicode_ci which equates accented latin characters. Additional information can be found here.

answered Jul 1, 2011 at 21:39

borrible

17.5k8 gold badges57 silver badges78 bronze badges

3 Comments

user213154 Over a year ago

user825466 did set COLLATE utf8_unicode_ci and this is why MySQL returns matches such as the one in the example that he or she did NOT want.

borrible Over a year ago

@fsb - Yes, I was explaining to the question writer why they were seeing the collation. My reading of the question was that they were unaware of the fact.

user825466 Over a year ago

You're both right - I didn't know why, and I also did not want it to happen. I ended up coding around it with PHP, but if the need arises in the future I'll specify the collation in the SELECT statement. Thanks.

Zeal · Accepted Answer · 2013-02-07 14:17:36Z

0

I found out, that you get the requested result using REGEXP

SELECT * FROM table WHERE name REGEXP 'namé';

But this doesn't help if you try to group exactly by name.

answered Feb 7, 2013 at 14:17

Zeal

2491 silver badge3 bronze badges

Collectives™ on Stack Overflow

MySQL matching unicode characters with ascii version

5 Answers 5

1 Comment

1 Comment

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related