PHP: Convert unicode codepoint to UTF-8

Question

I have my data in this format: U+597D or like this U+6211. I want to convert them to UTF-8 (original characters are 好 and 我). How can I do it?

Is your original data UTF-16 code units, or Unicode code points? — Thanatos
– Thanatos, Commented Nov 26, 2009 at 22:02

Mez · Accepted Answer · 2012-11-23 10:58:59Z

47

$utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $string), ENT_NOQUOTES, 'UTF-8');

is probably the simplest solution.

edited Nov 23, 2012 at 10:58

answered Nov 26, 2009 at 21:54

Mez

25k14 gold badges75 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Dor Over a year ago

That results in HTML entity, not a UTF8 character :)

Mez Over a year ago

Not in my tests it doesn't. It converts the code as shown in the Q to a HTML entity... THEN decodes the html entity.

Thanatos Over a year ago

Your regex won't match all code points - you need {4,5} to match characters higher than U+FFFF.

Anthony Over a year ago

No, the problem is that my browser shows "ɕD;" and "&#597D;" in the html-source of the page, while it's supposed to show "好"

Thanatos Over a year ago

Is your browser using the correct character-encoding? You'll probably have to specify it, either using a meta tag, or by sending it in the HTTP-headers. On Firefox, I can go View -> Character Encoding to both view & change the current encoding that FF is using.

|

velcrow · Accepted Answer · 2011-08-22 20:21:07Z

function utf8($num)
{
    if($num<=0x7F)       return chr($num);
    if($num<=0x7FF)      return chr(($num>>6)+192).chr(($num&63)+128);
    if($num<=0xFFFF)     return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
    if($num<=0x1FFFFF)   return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128).chr(($num&63)+128);
    return '';
}

function uniord($c)
{
    $ord0 = ord($c{0}); if ($ord0>=0   && $ord0<=127) return $ord0;
    $ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);
    $ord2 = ord($c{2}); if ($ord0>=224 && $ord0<=239) return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128);
    $ord3 = ord($c{3}); if ($ord0>=240 && $ord0<=247) return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128);
    return false;
}

utf8() and uniord() try to mirror the chr() and ord() functions on php:

echo utf8(0x6211)."\n";
echo uniord(utf8(0x6211))."\n";
echo "U+".dechex(uniord(utf8(0x6211)))."\n";

//In your case:
$wo='U+6211';
$hao='U+597D';
echo utf8(hexdec(str_replace("U+","", $wo)))."\n";
echo utf8(hexdec(str_replace("U+","", $hao)))."\n";

output:

我
25105
U+6211
我
好

Rabin Lama Dong · Accepted Answer · 2017-07-06 11:35:14Z

15

PHP 7+

As of PHP 7, you can use the Unicode codepoint escape syntax to do this.

echo "\u{597D}"; outputs 好.

answered Jul 6, 2017 at 11:35

Rabin Lama Dong

2,4761 gold badge29 silver badges34 bronze badges

2 Comments

Alex Chiang Over a year ago

Is there a simple way to convert them? For sometimes, the code just handle the request string(endpoint like "\u597D") from client.

AnnoyinC Over a year ago

This does not work when the "codepoint" string is variable. $pt = 28ff; echo "\u{$pt}"; >>> "\u28ff"

John Slegers · Accepted Answer · 2016-03-27 19:34:56Z

I just wrote a polyfill for missing multibyte versions of ord and chr with the following in mind:

It defines functions mb_ord and mb_chr only if they don't already exist. If they do exist in your framework or some future version of PHP, the polyfill will be ignored.
It uses the widely used mbstring extension to do the conversion. If the mbstring extension is not loaded, it will use the iconv extension instead.

I also added functions for HTMLentities encoding / decoding and encoding / decoding to JSON format as well as some demo code for how to use these functions

Code

if (!function_exists('codepoint_encode')) {
    function codepoint_encode($str) {
        return substr(json_encode($str), 1, -1);
    }
}

if (!function_exists('codepoint_decode')) {
    function codepoint_decode($str) {
        return json_decode(sprintf('"%s"', $str));
    }
}

if (!function_exists('mb_internal_encoding')) {
    function mb_internal_encoding($encoding = NULL) {
        return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
    }
}

if (!function_exists('mb_convert_encoding')) {
    function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
        return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
    }
}

if (!function_exists('mb_chr')) {
    function mb_chr($ord, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            return pack("N", $ord);
        } else {
            return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_ord')) {
    function mb_ord($char, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
            return $ord;
        } else {
            return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_htmlentities')) {
    function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') {
        return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) {
            return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0]));
        }, $string);
    }
}

if (!function_exists('mb_html_entity_decode')) {
    function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') {
        return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
    }
}

How to use

echo "\nGet string from numeric DEC value\n";
var_dump(mb_chr(25105));
var_dump(mb_chr(22909));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0x6211));
var_dump(mb_chr(0x597D));

echo "\nGet numeric value of character as DEC int\n";
var_dump(mb_ord('我'));
var_dump(mb_ord('好'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('我')));
var_dump(dechex(mb_ord('好')));

echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('我好', false));
var_dump(mb_html_entity_decode('&#25105;&#22909;'));

echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('我好'));
var_dump(mb_html_entity_decode('&#x6211;&#x597D;'));

echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("我好"));
var_dump(codepoint_decode('\u6211\u597d'));

Output

Get string from numeric DEC value
string(3) "我"
string(3) "好"

Get string from numeric HEX value
string(3) "我"
string(3) "好"

Get numeric value of character as DEC string
int(25105)
int(22909)

Get numeric value of character as HEX string
string(4) "6211"
string(4) "597d"

Encode / decode to DEC based HTML entities
string(16) "&#25105;&#22909;"
string(6) "我好"

Encode / decode to HEX based HTML entities
string(16) "&#x6211;&#x597D;"
string(6) "我好"

Use JSON encoding / decoding
string(12) "\u6211\u597d"
string(6) "我好"

If the string was a json formated $str='{"a":"\u51fa\u884c"}';, the function var_dump(codepoint_decode($str)); \\outputs NULL.

eleg · Accepted Answer · 2012-08-09 13:24:29Z

2

mb_convert_encoding(
    preg_replace("/U\+([0-9A-F]*)/"
        ,"&#x\\1;"
        ,'U+597DU+6211'
    )
    ,"UTF-8"
    ,"HTML-ENTITIES"
);

works fine, too.

answered Aug 9, 2012 at 13:24

eleg

211 bronze badge

Comments

Php'Regex · Accepted Answer · 2017-03-05 10:27:00Z

2

<?php

function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}

$your_input='U+597D';

echo (chr_utf8(hexdec(ltrim($your_input,'U+'))));

// Output 好

If you want to use a callback function you can try it :

<?php

// Note: function chr_utf8 shown above is required

$your_input='U+597DU+6211';

$result=preg_replace_callback('#U\+([a-f0-9]+)#i',function($a){return chr_utf8(hexdec($a[1]));},$your_input);

echo $result;

// Output 好我

Check it in https://eval.in/748187

edited Mar 5, 2017 at 10:27

answered Mar 5, 2017 at 10:21

Php'Regex

2133 silver badges4 bronze badges

Comments

Tschallacka · Accepted Answer · 2015-12-02 15:40:36Z

I was in the position I needed to filter specific characters without affecting the html because I was using a wysiwig editor, but people copy pasting from word would add some nice unrenderable characters to the content.

My solution boils down to simple replacement lists.

class ReplaceIllegal {
    public static $find = array ( 0 => '\x0', 1 => '\x1', 2 => '\x2', 3 => '\x3', 4 => '\x4', 5 => '\x5', 6 => '\x6', 7 => '\x7', 8 => '\x8', 9 => '\x9', 10 => '\xA', 11 => '\xB', 12 => '\xC', 13 => '\xD', 14 => '\xE', 15 => '\xF', 16 => '\x10', 17 => '\x11', 18 => '\x12', 19 => '\x13', 20 => '\x14', 21 => '\x15', 22 => '\x16', 23 => '\x17', 24 => '\x18', 25 => '\x19', 26 => '\x1A', 27 => '\x1B', 28 => '\x1C', 29 => '\x1D', 30 => '\x1E', 31 => '\x80', 32 => '\x81', 33 => '\x82', 34 => '\x83', 35 => '\x84', 36 => '\x85', 37 => '\x86', 38 => '\x87', 39 => '\x88', 40 => '\x89', 41 => '\x8A', 42 => '\x8B', 43 => '\x8C', 44 => '\x8D', 45 => '\x8E', 46 => '\x8F', 47 => '\x90', 48 => '\x91', 49 => '\x92', 50 => '\x93', 51 => '\x94', 52 => '\x95', 53 => '\x96', 54 => '\x97', 55 => '\x98', 56 => '\x99', 57 => '\x9A', 58 => '\x9B', 59 => '\x9C', 60 => '\x9D', 61 => '\x9E', 62 => '\x9F', 63 => '\xA0', 64 => '\xA1', 65 => '\xA2', 66 => '\xA3', 67 => '\xA4', 68 => '\xA5', 69 => '\xA6', 70 => '\xA7', 71 => '\xA8', 72 => '\xA9', 73 => '\xAA', 74 => '\xAB', 75 => '\xAC', 76 => '\xAD', 77 => '\xAE', 78 => '\xAF', 79 => '\xB0', 80 => '\xB1', 81 => '\xB2', 82 => '\xB3', 83 => '\xB4', 84 => '\xB5', 85 => '\xB6', 86 => '\xB7', 87 => '\xB8', 88 => '\xB9', 89 => '\xBA', 90 => '\xBB', 91 => '\xBC', 92 => '\xBD', 93 => '\xBE', 94 => '\xBF', 95 => '\xC0', 96 => '\xC1', 97 => '\xC2', 98 => '\xC3', 99 => '\xC4', 100 => '\xC5', 101 => '\xC6', 102 => '\xC7', 103 => '\xC8', 104 => '\xC9', 105 => '\xCA', 106 => '\xCB', 107 => '\xCC', 108 => '\xCD', 109 => '\xCE', 110 => '\xCF', 111 => '\xD0', 112 => '\xD1', 113 => '\xD2', 114 => '\xD3', 115 => '\xD4', 116 => '\xD5', 117 => '\xD6', 118 => '\xD7', 119 => '\xD8', 120 => '\xD9', 121 => '\xDA', 122 => '\xDB', 123 => '\xDC', 124 => '\xDD', 125 => '\xDE', 126 => '\xDF', 127 => '\xE0', 128 => '\xE1', 129 => '\xE2', 130 => '\xE3', 131 => '\xE4', 132 => '\xE5', 133 => '\xE6', 134 => '\xE7', 135 => '\xE8', 136 => '\xE9', 137 => '\xEA', 138 => '\xEB', 139 => '\xEC', 140 => '\xED', 141 => '\xEE', 142 => '\xEF', 143 => '\xF0', 144 => '\xF1', 145 => '\xF2', 146 => '\xF3', 147 => '\xF4', 148 => '\xF5', 149 => '\xF6', 150 => '\xF7', 151 => '\xF8', 152 => '\xF9', 153 => '\xFA', 154 => '\xFB', 155 => '\xFC', 156 => '\xFD', 157 => '\xFE', );
    private static $replace = array ( 0 => '&#0;', 1 => '&#1;', 2 => '&#2;', 3 => '&#3;', 4 => '&#4;', 5 => '&#5;', 6 => '&#6;', 7 => '&#7;', 8 => '&#8;', 9 => '&#9;', 10 => '&#10;', 11 => '&#11;', 12 => '&#12;', 13 => '&#13;', 14 => '&#14;', 15 => '&#15;', 16 => '&#16;', 17 => '&#17;', 18 => '&#18;', 19 => '&#19;', 20 => '&#20;', 21 => '&#21;', 22 => '&#22;', 23 => '&#23;', 24 => '&#24;', 25 => '&#25;', 26 => '&#26;', 27 => '&#27;', 28 => '&#28;', 29 => '&#29;', 30 => '&#30;', 31 => '&#128;', 32 => '&#129;', 33 => '&#130;', 34 => '&#131;', 35 => '&#132;', 36 => '&#133;', 37 => '&#134;', 38 => '&#135;', 39 => '&#136;', 40 => '&#137;', 41 => '&#138;', 42 => '&#139;', 43 => '&#140;', 44 => '&#141;', 45 => '&#142;', 46 => '&#143;', 47 => '&#144;', 48 => '&#145;', 49 => '&#146;', 50 => '&#147;', 51 => '&#148;', 52 => '&#149;', 53 => '&#150;', 54 => '&#151;', 55 => '&#152;', 56 => '&#153;', 57 => '&#154;', 58 => '&#155;', 59 => '&#156;', 60 => '&#157;', 61 => '&#158;', 62 => '&#159;', 63 => '&#160;', 64 => '&#161;', 65 => '&#162;', 66 => '&#163;', 67 => '&#164;', 68 => '&#165;', 69 => '&#166;', 70 => '&#167;', 71 => '&#168;', 72 => '&#169;', 73 => '&#170;', 74 => '&#171;', 75 => '&#172;', 76 => '&#173;', 77 => '&#174;', 78 => '&#175;', 79 => '&#176;', 80 => '&#177;', 81 => '&#178;', 82 => '&#179;', 83 => '&#180;', 84 => '&#181;', 85 => '&#182;', 86 => '&#183;', 87 => '&#184;', 88 => '&#185;', 89 => '&#186;', 90 => '&#187;', 91 => '&#188;', 92 => '&#189;', 93 => '&#190;', 94 => '&#191;', 95 => '&#192;', 96 => '&#193;', 97 => '&#194;', 98 => '&#195;', 99 => '&#196;', 100 => '&#197;', 101 => '&#198;', 102 => '&#199;', 103 => '&#200;', 104 => '&#201;', 105 => '&#202;', 106 => '&#203;', 107 => '&#204;', 108 => '&#205;', 109 => '&#206;', 110 => '&#207;', 111 => '&#208;', 112 => '&#209;', 113 => '&#210;', 114 => '&#211;', 115 => '&#212;', 116 => '&#213;', 117 => '&#214;', 118 => '&#215;', 119 => '&#216;', 120 => '&#217;', 121 => '&#218;', 122 => '&#219;', 123 => '&#220;', 124 => '&#221;', 125 => '&#222;', 126 => '&#223;', 127 => '&#224;', 128 => '&#225;', 129 => '&#226;', 130 => '&#227;', 131 => '&#228;', 132 => '&#229;', 133 => '&#230;', 134 => '&#231;', 135 => '&#232;', 136 => '&#233;', 137 => '&#234;', 138 => '&#235;', 139 => '&#236;', 140 => '&#237;', 141 => '&#238;', 142 => '&#239;', 143 => '&#240;', 144 => '&#241;', 145 => '&#242;', 146 => '&#243;', 147 => '&#244;', 148 => '&#245;', 149 => '&#246;', 150 => '&#247;', 151 => '&#248;', 152 => '&#249;', 153 => '&#250;', 154 => '&#251;', 155 => '&#252;', 156 => '&#253;', 157 => '&#254;', );

    /* 
     * replace illegal characters for escaped html character but don't touch anything else.  
     */
    public static function getSaveValue($value) {       
        return str_replace(self::$find, self::$replace, $value);
    }
    public static function makeIllegal($find,$replace) {
        self::$find[] = $find;
        self::$replace[] = $replace;
    }
}

Claudio Garaycochea · Accepted Answer · 2018-04-04 03:30:12Z

1

This worked fine for me. If you have a string "Letters u00e1 u00e9 etc." replace by "Letters á é".

function unicode2html($str){
    // Set the locale to something that's UTF-8 capable
    setlocale(LC_ALL, 'en_US.UTF-8');
    // Convert the codepoints to entities
    $str = preg_replace("/u([0-9a-fA-F]{4})/", "&#x\\1;", $str);
    // Convert the entities to a UTF-8 string
    return iconv("UTF-8", "ISO-8859-1//TRANSLIT", $str);
}

answered Apr 4, 2018 at 3:30

Claudio Garaycochea

111 bronze badge

Comments

Dor · Accepted Answer · 2009-11-26 21:54:47Z

0

With the aid of the following table:

http://en.wikipedia.org/wiki/UTF-8#Description

can't be simpler :)

Simply mask the unicode numbers according to which range they fit in.

answered Nov 26, 2009 at 21:54

Dor

7,5225 gold badges36 silver badges47 bronze badges

Collectives™ on Stack Overflow

PHP: Convert unicode codepoint to UTF-8

9 Answers 9

7 Comments

Comments

PHP 7+

2 Comments

Code

How to use

Output

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

7 Comments

Comments

PHP 7+

2 Comments

Code

How to use

Output

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related