16

Assume that I need to insert the following document:

{
    title: 'Péter'
}

(note the é)

It gives me an error when I use the following PHP-code ... :

$db->collection->insert(array("title" => "Péter"));

... because it needs to be utf-8.

So I should use this line of code:

$db->collection->insert(array("title" => utf8_encode("Péter")));

Now, when I request the document, I still have to decode it ... :

$document = $db->collection->findOne(array("_id" => new MongoId("__someID__")));
$title = utf8_decode($document['title']);

Is there some way to automate this process? Can I change the character-encoding of MongoDB (I'm migrating a MySQL-database that's using cp1252 West Europe (latin1)?

I already considered changing the Content-Type-header, problem is that all static strings (hardcoded) aren't utf8...

Thanks in advance! Tim

1
  • Did you find an answer below that is acceptable? If so, will you please accept it? Commented Jun 14, 2013 at 18:21

3 Answers 3

19

JSON and BSON can only encode / decode valid UTF-8 strings, if your data (included input) is not UTF-8 you need to convert it before passing it to any JSON dependent system, like this:

$string = iconv('UTF-8', 'UTF-8//IGNORE', $string); // or
$string = iconv('UTF-8', 'UTF-8//TRANSLIT', $string); // or even
$string = iconv('UTF-8', 'UTF-8//TRANSLIT//IGNORE', $string); // not sure how this behaves

Personally I prefer the first option, see the iconv() manual page. Other alternatives include:

You should always make sure your strings are UTF-8 encoded, even the user-submitted ones, however since you mentioned that you're migrating from MySQL to MongoDB, have you tried exporting your current database to CSV and using the import scripts that come with Mongo? They should handle this...


EDIT: I mentioned that BSON can only handle UTF-8, but I'm not sure if this is exactly true, I have a vague idea that BSON uses UTF-16 or UTF-32 to encode / decode data, but I can't check now.

Sign up to request clarification or add additional context in comments.

7 Comments

AlixAxel: Using mb_convert_encoding() seems like a good idea for the cp1252 data to get it to UTF-8. But have you tested the other code you've included in your answer? If the PHP source code file Tim (the OP) is using to test is encoded as UTF-8, Tim shouldn't need to do any encoding.
Also, how might a CSV export/import help?
@AdamMonsen: The problem is not in the .php file but in the database data, hence the import/export and the functions I mentioned. I've tested (and I prefer) the iconv extension over any other option.
@AdamMonsen: Also, I'm not sure if you misunderstood the question or not, but your statement on your first comment is incorrect. A UTF-8 encoded file will not automatically make all the data fed to it from outside sources valid UTF-8.
AlixAxel: re iconv() vs. mb_convert_encoding(), ok, I'll take your word for that, I don't have much experience with either. As for my first comment, I realize his data is in latin1 and changing the encoding of a file with PHP code has no effect on data processed by the script. I was referring to the encoding of what is probably a test .php script with a code snippet including a literal string: "Péter".
|
3

As @gates said, all string data in BSON is encoded as UTF-8. MongoDB assumes this.

Another key point which neither answer addresses: PHP is not Unicode aware. As of 5.3, anyway. PHP 6 will supposedly be Unicode-aware. What this means is you have to know what encoding is used by your operating system by default and what encoding PHP is using.

Let's get back to your original question: "Is there some way to automate this process?" ... my suggestion is to make sure you are always using UTF-8 throughout your application. Configuration, input, data storage, presentation, everything. Then the "automated" part is that most of your PHP code will be simpler since it always assumes UTF-8. No conversions necessary. Heck, nobody said automation was cheap. :)

Here's kind of an aside. If you created a little PHP script to test that insert() code, figure out what encoding your file is, then convert to UTF-8 before inserting. For example, if you know the file is ISO-8859-1, try this:

$title = mb_convert_encoding("Péter", "UTF-8", "ISO-8859-1");
$db->collection->insert(array("title" => $title));

See also

2 Comments

My rationale: if you export to CSV you can make your data valid UTF-8 in one simple step, by using the mongo scripts, importing it is just as simple. Regarding the encoding options, iconv comes from the C world so it's much more reliable than the mbstring extension. If you don't have access to any, utf8_decode() and utf8_encode() is the next best thing, since all invalid codepoints will be dropped but the remaining data is still valid UTF-8.
@AlixAxel: (a) Maybe a more complete example would help. (b) iconv() is more reliable? really?
2

Can I change the character-encoding of MongoDB...

No data is stored in BSON. According to the BSON spec, all string are UTF-8.

Now, when I request the document, I still have to decode it ... : Is there some way to automate this process?

It sounds like you are trying to output the data to web page. Needing to "decode" text that was already encoded seems incorrect.

Could this output problem be a configuration issue with Apache+PHP? UTF8+PHP is not automatic, a quick online search brought up several tutorials on this topic.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.