0

Is it possible to detect the user's string's char set?

If not, how about the next question..

Are there reliable built-in PHP functions that can accurately tell if the user supplied string ( be it supplied thru get/post/cookie etc), are in a UTF-8 or not? In other words, can I do something like

is_utf8($_GET['first_name'])

Is there anyway this function could produce a TRUE where in reality the first_name was not in UTF-8?

1
  • Please upvote and accept answers to your previous questions (this one also). See stackoverflow.com/faq#reputation. Commented Jan 31, 2012 at 4:03

1 Answer 1

1

Regarding 1:

You can give mb_detect_encoding a try, but it's pretty much a shot in the dark. An "encoded" string is just a bunch of bytes. Such byte sequences are often equally valid in any number of different encodings. It's therefore by definition not possible to detect an unknown encoding reliably, you can only guess. For this reason there exist meta information such as HTTP headers which should communicate the encoding of the transferred content. Check those if available.

Regarding 2:

mb_check_encoding($var, 'UTF-8') will tell you whether the string is a valid UTF-8 string. As far as I've seen, in recent versions of PHP it does what it says on the tin. That still doesn't mean the string is necessarily really a UTF-8 string, it just means the byte sequence is in an order that is valid in UTF-8.

Sign up to request clarification or add additional context in comments.

3 Comments

how about assuming that the incoming string is in utf-8 and doing an decode on it? if the results match, can we safely conclude that the assumption was right that yes in deed it was encoded in utf-8? So in PHP, if I do the following and it comes back as TRUE, would that be a good way to verify that it was UTF-8? if ($_GET['name'] == utf8_decode($_GET['name'])
Are we saying that there is nothing in the PHP world as we stand today to verify that a user input is in utf-8 or not?? How am I going to make sure that I can escape/sanitize things properly for the current user strig at hand? My escape/sanitization filters are all designed to deal with utf-8.
Text is just bytes, if you don't know which encoding these bytes represent pretty much all bets are off. If you don't understand why this is, I recommend you read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. That's not just a PHP thing, it's a general issue regarding how text is represented by computers.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.