2

What is the best way to convert user input to UTF-8?

I have a simple form where a user will pass in HTML, the HTML can be in any language and it can be in any character encoding format.

My question is:

  • Is it possible to represent everything as UTF-8?

  • What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?

I am trying to work out how to best implement this - advice and links appreciated.

I am making use of Codeigniter and its input class to retrieve post data.

A few points I should make:

  • I need to convert HTML special characters to their respective entities
  • It might be a good idea to accept encoding and return it in that same encoding. However, my web app is making use of :

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

This might have an adverse effect on things.

6 Answers 6

4

Specify accept-charset in your <form> tag to tell the browser to submit user-entered data encoded in UTF-8:

<form action="foo" accept-charset="UTF-8">...</form>

See here for a complete guide on HOW TO Use UTF-8 Throughout Your Web Stack.

Sign up to request clarification or add additional context in comments.

8 Comments

What would happen if a user pastes in HTML from their editor which is in the windows-1252 or some sort of iso encoding? Would the browser have no trouble converting this? Thank you for the link, looks super useful/thorough.
The browser should automatically send the info with the correct character encoding...
This might not work in IE according to: w3schools.com/tags/att_form_accept_charset.asp - have you experienced any problems with IE?
@Abs: That attribute is informative only. It does not technically prevent that any kind of data is send to your PHP script.
@hakre Technically true, but then you're just sh*t-outta-luck. :) You can't really do more than specify what you expect, clients will need to comply or all bets are off.
|
2

Is it possible to represent everything as UTF-8?

Yes, UTF-8 is a Unicode encoding, so you can use any character defined in Unicode. That's the best you can do with a computer to date.

What can I use to effectively convert any character encoding to UTF-8

iconv lets you convert virtually any encoding to any other encoding. But, for that you have to know what encoding you're dealing with. You can't say "iconv, whatever this is, make it UTF-8!". That's unfortunately not how it works. You can only say "iconv, I have this string here in BIG5, please convert that to UTF-8.".

If you're only dealing with form data in UTF-8 though, you'll probably never need to convert anything.

so that I can parse it with PHP string functions

"PHP string functions" work on bytes. They don't care about characters or encodings. Depending on what you want to do, working with naive PHP string functions on UTF-8 text will give you bad results. Use encoding-aware string functions in the MB extension for any multi-byte encoding string manipulation.

save it to my database

Just make sure your database stores text in UTF-8 and you have set your database connection to UTF-8 (i.e. the database knows you're sending it UTF-8 data). You should be able to specify that in the CodeIgniter database connection settings.

subsequently echo out using htmlentities?

Just echo htmlentities($text), nothing more you need to do.

However, my web app is making use of : <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

This might have an adverse effect on things.

Not at all. It just signals to the browser that your page is encoded in UTF-8. Now you just need to make sure that's actually the case (as you're trying to do anyway). It also implies to the browser that it should send UTF-8 to the server. You can make that explicit with the accept-charset attribute on forms.

May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which might help you understand more.

5 Comments

@hakre I'd like to hear your objection to UTF-8 in the database. What would you prefer?
+1: Well done answer, some PHP functions (next to mb) have different encoding support however. And avoid having UTF-8 in the MySQL database when you don't need to. But well, defer the details :).
MySQL: There are two things: Storage requirements and character support. MySQL uses three bytes per character for UTF-8 which can lead to have a a to consume more bytes (then needed) for some tables, e.g. temporary tables which can cause trouble/performance drain. Next to that not all panes of Unicode are supported, MySQL supports the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0.
@hakre Interesting, I have never looked into that. MySQL 5.5+ supports Unicode 5.0 though. UTF-8 still is wasteful apparently.
Three bytes still in 5.5: "The utf8 character set is the same in MySQL 5.5 as before 5.5 and has exactly the same characteristics: [...]" Ref. - Lookout for varchar / char columns, char needs to reserve the max bytes needed per character, even if not needed as well.
1

1) Is it possible to represent everything as UTF-8?

Yes, everything defined in UNICODE. That's the most you can get nowadays, and there is room for the future that UNICODE can support.

2) What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?

The only thing you need to know is the actual encoding of your data. If you want your webapplication to support UTF-8 for input and output, the frontend needs to signal that it supports UTF-8. See Character Encodings for a guide regarding your applications user-interface.

Within PHP you need to feed any function with the encoding it supports. Some need to have the encoding specified, for some you need to convert it. Always check the function docs if it supports what you ask for. Additionally check your PHP configuration.

Related:

  1. Preparing PHP application to use with UTF-8
  2. How to detect malformed utf-8 string in PHP?

5 Comments

I would like a [citation-needed] for the claim that UTF-8 cannot encode all Unicode code points!
@deceze: Is that enough for a starter? "RFC 3629 UTF-8 November 2003 3. UTF-8 definition UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646] In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. " - The UTF-16 range is just not the full range. So isn't UTF-8.
Then I'll ask you where it says that UTF-16 is not the full range. Every piece of documentation I am looking at says that the current Unicode range is 000000 - 10FFFF and that all UTF encodings can encode all of these points. UTF-8 was originally even designed to use up to six octets, which means it could encode many more points if necessary.
That was rather nitpicky indeed. :)
@deceze: Looks I'm too much nitpicking here. I'll change the answer, the panes I'm referring to are not defined yet. currently 21 bits are in use only, safe for UTF-8. UTF-8 encodings excludes some surrogates but includes some non-character code-points.
0

If you want to change the encoding of a string you can try

$utf8_string = mb_convert_encoding( $yourBadString , 'UTF-8' );

5 Comments

Convert from what is the question though. If you don't know that, you can't reasonably and reliably convert anything.
If you don't know then you can use mb_detect_encoding() to find out. Though I've never had to detect encoding to force it to UTF-8, the third param of mb_convert_encoding is optional and not needed.
If you don't supply the third parameter, it just defaults to the internally set encoding. Auto-detecting an encoding is somewhere between very very tricky to impossible, at the very least it's not perfectly reliable. It's all just bits, and often a bit sequence is equally valid in many different encodings, so "auto-detecting" often just comes down to guessing.
Yes if you don't supply the 3rd param it will default, to the internal encoding. But I disagree when you say it is "very very tricky to impossible" we do this all the time in our applications. Working with the DoD allows us the opportunity to deal with a wide (total) different languages, currencies and encodings since we obviously have troops all over the globe. We have never had a problem with this technique.
Then apparently you're not really dealing with a lot of ambiguous encodings: ideone.com/q2Skp
0

I found out that the only thing that works out for UTF-8 encoding is setting inside my config.php

putenv('LC_ALL=en_US.utf8'); // or whatever language you need
setlocale(LC_ALL, 'en_US.utf8');  // or whatever language you need
bindtextdomain("mydomain", dirname(__FILE__) . "/../language");
textdomain("mydomain");

Comments

-1

EDIT :

Is it possible to represent everything as UTF-8?

Yes, these is what you need to ensure :

  • html : headers/meta-header set to utf-8
  • all files saved as utf-8
  • database collation, tables and data encoding to utf-8

What can I use to effectively convert any character encoding to UTF-8

You can use utf8_encode (Since for a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation,ref) before saving it into your database.

// eg
$name = utf8_encode($this->input->post('name'));

And as i mention before, you need to make sure database collation, tables and data encoding to utf-8. In CI, at your database connection config

// Make sure have these lines
$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';

2 Comments

utf8_encode only converts from latin-1 to UTF-8. If the user is not sending you latin-1, this function is useless. If the user is sending you latin-1, you can only support the 256 characters of the latin-1 encoding. If you can specify to the user to send you latin-1, you can as well specify that you want UTF-8 directly.
@deceze, thanks for remind me not too simplify the question. I update my answer for your downvote(yay). My previous answer was indeed too over simplify the question. Lazy is my virtue (lol) :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.