3

I've got the following double byte utf8 character

\ud83d\ude04

(It's an ios emoji). I want to convert to a utf-16

U+1F604

How do I do this? I've tried the following:

$utf8_string = "\ud83d\ude04";
$utf16_string = mb_convert_encoding($utf8_string, 'UTF-16', 'UTF-8');

But I get the original utf8 string. It doesnt get converted.

I'm thinking I may need to decode the utf8 string first. I've tried doing this with json_decode (which works quite nicely to decode utf8 character sets). But still no joy.

1
  • 1
    \u... is not UTF-8 and U+... is not UTF-16. The former looks like a JSON encoded representation of the character and the latter looks like a formal Unicode code point. Neither is a UTF encoding. Commented Apr 9, 2014 at 12:57

1 Answer 1

3

First off, let's get the terms right:

  • \ud83d\ude04 is a Unicode escape sequence as used in, for example, Javascript. It is not "UTF-8".
  • It is also not "double byte", but rather a surrogate pair.
  • U+1F604 is the official notation of a Unicode code point. It is not "UTF-16".

The first step is to get from "\ud83d\ude04" to a UTF-8 encoded string. The easiest method is:

$utf8 = json_decode('"\ud83d\ude04"'); // note the added "" quotes

To convert from here to a UTF-16 encoded string, simply do:

iconv('UTF-8', 'UTF-16', $utf8)

However the result is not "U+1F604", but rather a UTF-16 encoded string (the hex representation of which is feffd83dde04).

To get a Unicode code point notation, the easiest way is probably to convert to UCS-4 and adjust leading zeros:

$ucs4      = iconv('UTF-8', 'UCS-4', $utf8);
$codepoint = sprintf('U+%04s', ltrim(strtoupper(bin2hex($ucs4)), '0'));
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.