0

Driving me nuts...

Page with form is encoded as Unicode (UTF-8) via:

<meta http-equiv="content-type" content="text/html; charset=utf-8">

entry column in database is text utf8_unicode_ci

copying text from a Word document with " in it, like this: “1922.” is insta-fail and ends up in the database as â��1922.â�� (typing new data into the form, including " works fine... it's cut and pasting from Word...)

PHP steps behind the scenes are:

  • grab value from POST
  • run through HTML Purifier default settings
  • run through mysql_real_escape_string
  • insert query into dbase

Help?

2 Answers 2

1

“1922.” and "1922." are 2 different strings.
The quotes from word are not double quotes “ != "

The column that you describe is text utf8_unicode_ci. utf8_unicode_ci is the collation, make sure the charset on that column is set to utf8.

Then I would make sure that you setup correct encoding for each connection using SET NAMES utf8 COLLATE utf8_unicode_ci...

If you've done that and it's still not saved properly, make sure your php has mbstrings enabled and try to work with mb_ functions.

There are many root causes you might have, but I think the charset on column and SET NAMES ... should solve it.

Sign up to request clarification or add additional context in comments.

1 Comment

calling mysql_query('SET NAMES utf8'); in the init.php include was the final piece of the puzzle, thank you
1

Call mysql_set_charset to let the database know you are going to be sending it UTF-8 encoded strings.

typing new data into the form, including " works fine...

Well " is a normal ASCII quote. and aren't, they're smart quotes, which are non-ASCII characters. Whether they come from Word is unimportant; all your non-ASCII characters will be treated the same.

  • grab value from POST
  • run through HTML Purifier default settings

That's a bad idea. HTML Purifier should be run over strings that are HTML and you intend to output as HTML, for the relatively rare case where you need to let users submit HTML.

It is totally the wrong thing to run over all input text. Normally you should be allowing any old text, and then when you output that text inside HTML you should be calling htmlspecialchars() over it.

Otherwise you're breaking the ability of users to enter < and & like I am in this post, and you still risk cross-site-scripting when you are outputting processed or non-input-sourced data.

7 Comments

Hi Bob! I was using HTML Purifier to strip out all HTML from that form field, as it does get displayed on the site. Is that still a bad practice?
If text content from the database is getting output as raw HTML, that's a really bad thing. You need to fix it on the output end by calling htmlspecialchars() every time you drop a string into HTML, for example: <p>Hello <?php echo htmlspecialchars($name); ?>!</p> or echo "<p>Hello ".htmlspecialchars($name)."!</p>"; never echo "<p>Hello $name!</p>";. (You can make a function with a shorter name to avoid so much typing.) You can't fix this properly on the input end at all, HTMLPurifier or no.
I'm currently running output through htmlentities before display. I don't also need to use htmlspecialchars, do I?
You should use htmlspecialchars() as a better version of htmlentities(). htmlentities needlessly tries to encode all non-ASCII characters, and defaults to treating them as ISO-8859-1, so if you're using any other charset like UTF-8, it will totally screw them over unless you remember to pass the charset argument in each call. htmlspecialchars() only encodes the few characters like < that really need it. It's almost always the better function; it's a shame that so much crappy PHP tutorials, if they bother mention HTML-escaping at all, jump for nasty htmlentities().
Thanks for sticking with me Bob. I'll make the adjustments and test it tonight.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.