2

Another utf-8 related problem I believe...

I am using php to update data in a mysql db then display that data elsewhere in the site. Previously I have run into utf-8 problems before where special characters are displayed as question marks when viewed in a browser but this one seems slightly different.

I have a number of records to enter that contain the è character. If I enter this directly in the db then it appears correctly on the page so I take this to mean that utf-8 content is being output correctly.

However when I try and update the values in the db through php, then the è character is replaced. What appears instead is & Atilde ; & uml ; (without the spaces) which appears in the browser as è

I have the tables in the database set to use UTF-8. I believe this is correct cos, as mentioned, if I update the db through phpMyAdmin, its all ok. Similarly I have set the character encoding for the page which seems to be correct. I am also running the sql statement "SET NAMES 'utf8';" before trying to update the db.

Anyone have any other ideas as to where the problem may lie?

Many thanks

4 Answers 4

3

Yup.

The character you have is LATIN SMALL LETTER E WITH GRAVE. As you can see, in UTF-8 that character is encoded into two bytes 0xC3 and 0xA8.

But in many default, western encodings (such as ISO-8859-1) which are single-byte only, this multi-byte character is decoded as two separate characters, LATIN CAPITAL LETTER A WITH TILDE and DIAERESIS. Notice how they are both encoded as C3 and A8 in ISO-8859-1?

Furthermore, it looks like PHP is processing these characters through htmlentities() which result in the à and ¨ respectively.

So, where exactly is the problem in your code? Well, htmlentities() could be doing it all by itself since its 3rd argument is a encoding name - which you may not have properly set to 'UTF-8'. But it could be some other string processing function as well. (Note: As a general rule, it's a bad idea to store HTML entities in the database - this step should be reserved for time of display)

There are a bunch of other ways to trip yourself up with UTF-8 in php - I suggest hitting up the cheatsheet and make sure you're in good shape.

Sign up to request clarification or add additional context in comments.

3 Comments

Yup. A bit lengthy way to say "get rid of htmlentities".
I always like to explain exactly what's going on when encodings are involved. Anything I can do to elevate understanding is a win in my book.
Cheers for that. Much appreciated
1

Well it is your own code convert characters into entities.
To make it right:

  1. Ban htmlentities function from your scripts forever.
  2. Use htmlspecialchars, but not on insert, but whan displaying data.
  3. Repair existing data in the database using html_entity_decode.

Comments

0

I suppose you're taking the results of some form submission and inserting the results in the database. If so, you must ensure that you instruct the browser to send UTF-8 data and you should validate the user input for a valid UTF-8 stream.

Change your form element to include accept-charset:

<form accept-charset="utf-8" method="post" ... >
    <input type="text name="field" />
    ...
</form>

Validate the data with:

$valid = array_key_exists("field", $_POST) && !is_array($_POST['field']) &&
    preg_match('//u', $_POST['field']) && ...; //check length with mb_strlen etc.

Comments

0

I think you miss Content-Type declaration on the html page:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

If you don't have it, the browser will guess the encoding, and convert any characters outside of that encoding to entities when posting a form.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.