14

I'm running MySQL 5.1.50 and have a table that looks like this:

organizations | CREATE TABLE `organizations` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
  `url` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `phone` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=25837 DEFAULT CHARSET=utf8 |

The problem I'm having is that MySQL is matching unicode characters with ascii versions. For example when I search for a word with that contains an 'é', it will match the same word that has an 'e' instead, and vice versa:

mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT id, name FROM `organizations` WHERE `name` = 'Universite de Montreal';
    +-------+-------------------------+
| id    | name                    |
+-------+-------------------------+
| 16973 | Université de Montreal  |
+-------+-------------------------+
1 row in set (0.01 sec)

I get these results both from PHP and the command line console. How can I get accurate matches from my SELECT queries?

Thanks!

5 Answers 5

13

You specified the name column as text CHARACTER SET utf8 COLLATE utf8_unicode_ci which tells MySQL to consider e and é as equivalent in matching and sorting. That collation and utf8_general_ci both make a lot of things equivalent.

http://www.collation-charts.org/ is a great resource once you learn how to read the charts, which is pretty easy.

If you want e and é etc. to be considered different then you must choose a different collation. To find out what collations are on your server (assuming you're limited to UTF-8 encoding):

mysql> show collation like 'utf8%';

And choose using the collation charts as a reference.

One more special collation is utf8_bin in which there are no equivalencies, it's a binary match.

The only MySQL Unicode collations I'm aware of that are not language specific are utf8_unicode_ci, utf8_general_ci and utf8_bin. They are rather weird. The real purpose of a collation is to make the computer match and sort as a person from somewhere would expect. Hungarian and Turkish dictionaries have their entries ordered according to different rules. Specifying a collation allows you to sort and match according to such local rules.

For example, it seems Danes consider e and é equivalent but Icelanders don't:

mysql> select _utf8'e' collate utf8_danish_ci
    -> = _utf8'é' collate utf8_danish_ci as equal;
+-------+
| equal |
+-------+
|     1 |
+-------+

mysql> select _utf8'e' collate utf8_icelandic_ci
    -> = _utf8'é' collate utf8_icelandic_ci as equal;
+-------+
| equal |
+-------+
|     0 |
+-------+

Another handy trick is to fill a one column table with a bunch of characters you're interested in (it's easier from a script) and then MySQL can tell you the equivalencies:

mysql> create table t (c char(1) character set utf8);
mysql> insert into t values ('a'), ('ä'), ('á');
mysql> select group_concat(c) from t group by c collate utf8_icelandic_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a               |
| á               |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_danish_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,á             |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_general_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,ä,á           |
+-----------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks very much for the thorough reply!
8

Of course, this will work:

SELECT * FROM table WHERE name LIKE BINARY 'namé';

1 Comment

I tried all kinds of collation variants and '%º%' (ordinal indicator, not degree symbol) kept matching things it shouldn't. Saw this and tried it and it worked like a charm. Thanks!
2

one thing you can do with your query string is to decode it...

< ?php
$query="उनकी"; // some Unicode characters
$query=urldecode($query);
$qry= "SELECT * FROM table WHERE books LIKE '%$query%'";

//rest of the code....
?>

it worked for me. :)

Comments

1

You have set collation to utf8_unicode_ci which equates accented latin characters. Additional information can be found here.

3 Comments

user825466 did set COLLATE utf8_unicode_ci and this is why MySQL returns matches such as the one in the example that he or she did NOT want.
@fsb - Yes, I was explaining to the question writer why they were seeing the collation. My reading of the question was that they were unaware of the fact.
You're both right - I didn't know why, and I also did not want it to happen. I ended up coding around it with PHP, but if the need arises in the future I'll specify the collation in the SELECT statement. Thanks.
0

I found out, that you get the requested result using REGEXP

SELECT * FROM table WHERE name REGEXP 'namé';

But this doesn't help if you try to group exactly by name.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.