I have my data in this format: U+597D or like this U+6211. I want to convert them to UTF-8 (original characters are 好 and 我). How can I do it?
-
Is your original data UTF-16 code units, or Unicode code points?Thanatos– Thanatos2009-11-26 22:02:12 +00:00Commented Nov 26, 2009 at 22:02
-
They are Unicode code points from Unihan database.Anthony– Anthony2009-11-26 22:03:50 +00:00Commented Nov 26, 2009 at 22:03
9 Answers
$utf8string = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $string), ENT_NOQUOTES, 'UTF-8');
is probably the simplest solution.
7 Comments
function utf8($num)
{
if($num<=0x7F) return chr($num);
if($num<=0x7FF) return chr(($num>>6)+192).chr(($num&63)+128);
if($num<=0xFFFF) return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<=0x1FFFFF) return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128).chr(($num&63)+128);
return '';
}
function uniord($c)
{
$ord0 = ord($c{0}); if ($ord0>=0 && $ord0<=127) return $ord0;
$ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);
$ord2 = ord($c{2}); if ($ord0>=224 && $ord0<=239) return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128);
$ord3 = ord($c{3}); if ($ord0>=240 && $ord0<=247) return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128);
return false;
}
utf8() and uniord() try to mirror the chr() and ord() functions on php:
echo utf8(0x6211)."\n";
echo uniord(utf8(0x6211))."\n";
echo "U+".dechex(uniord(utf8(0x6211)))."\n";
//In your case:
$wo='U+6211';
$hao='U+597D';
echo utf8(hexdec(str_replace("U+","", $wo)))."\n";
echo utf8(hexdec(str_replace("U+","", $hao)))."\n";
output:
我
25105
U+6211
我
好
Comments
PHP 7+
As of PHP 7, you can use the Unicode codepoint escape syntax to do this.
echo "\u{597D}"; outputs 好.
2 Comments
I just wrote a polyfill for missing multibyte versions of ord and chr with the following in mind:
It defines functions
mb_ordandmb_chronly if they don't already exist. If they do exist in your framework or some future version of PHP, the polyfill will be ignored.It uses the widely used
mbstringextension to do the conversion. If thembstringextension is not loaded, it will use theiconvextension instead.
I also added functions for HTMLentities encoding / decoding and encoding / decoding to JSON format as well as some demo code for how to use these functions
Code
if (!function_exists('codepoint_encode')) {
function codepoint_encode($str) {
return substr(json_encode($str), 1, -1);
}
}
if (!function_exists('codepoint_decode')) {
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
}
if (!function_exists('mb_internal_encoding')) {
function mb_internal_encoding($encoding = NULL) {
return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
}
}
if (!function_exists('mb_convert_encoding')) {
function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
}
}
if (!function_exists('mb_chr')) {
function mb_chr($ord, $encoding = 'UTF-8') {
if ($encoding === 'UCS-4BE') {
return pack("N", $ord);
} else {
return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
}
}
}
if (!function_exists('mb_ord')) {
function mb_ord($char, $encoding = 'UTF-8') {
if ($encoding === 'UCS-4BE') {
list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
return $ord;
} else {
return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
}
}
}
if (!function_exists('mb_htmlentities')) {
function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') {
return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) {
return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0]));
}, $string);
}
}
if (!function_exists('mb_html_entity_decode')) {
function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') {
return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
}
}
How to use
echo "\nGet string from numeric DEC value\n";
var_dump(mb_chr(25105));
var_dump(mb_chr(22909));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0x6211));
var_dump(mb_chr(0x597D));
echo "\nGet numeric value of character as DEC int\n";
var_dump(mb_ord('我'));
var_dump(mb_ord('好'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('我')));
var_dump(dechex(mb_ord('好')));
echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('我好', false));
var_dump(mb_html_entity_decode('我好'));
echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('我好'));
var_dump(mb_html_entity_decode('我好'));
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("我好"));
var_dump(codepoint_decode('\u6211\u597d'));
Output
Get string from numeric DEC value
string(3) "我"
string(3) "好"
Get string from numeric HEX value
string(3) "我"
string(3) "好"
Get numeric value of character as DEC string
int(25105)
int(22909)
Get numeric value of character as HEX string
string(4) "6211"
string(4) "597d"
Encode / decode to DEC based HTML entities
string(16) "我好"
string(6) "我好"
Encode / decode to HEX based HTML entities
string(16) "我好"
string(6) "我好"
Use JSON encoding / decoding
string(12) "\u6211\u597d"
string(6) "我好"
1 Comment
$str='{"a":"\u51fa\u884c"}';, the function var_dump(codepoint_decode($str)); \\outputs NULL.<?php
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
$your_input='U+597D';
echo (chr_utf8(hexdec(ltrim($your_input,'U+'))));
// Output 好
If you want to use a callback function you can try it :
<?php
// Note: function chr_utf8 shown above is required
$your_input='U+597DU+6211';
$result=preg_replace_callback('#U\+([a-f0-9]+)#i',function($a){return chr_utf8(hexdec($a[1]));},$your_input);
echo $result;
// Output 好我
Check it in https://eval.in/748187
Comments
I was in the position I needed to filter specific characters without affecting the html because I was using a wysiwig editor, but people copy pasting from word would add some nice unrenderable characters to the content.
My solution boils down to simple replacement lists.
class ReplaceIllegal {
public static $find = array ( 0 => '\x0', 1 => '\x1', 2 => '\x2', 3 => '\x3', 4 => '\x4', 5 => '\x5', 6 => '\x6', 7 => '\x7', 8 => '\x8', 9 => '\x9', 10 => '\xA', 11 => '\xB', 12 => '\xC', 13 => '\xD', 14 => '\xE', 15 => '\xF', 16 => '\x10', 17 => '\x11', 18 => '\x12', 19 => '\x13', 20 => '\x14', 21 => '\x15', 22 => '\x16', 23 => '\x17', 24 => '\x18', 25 => '\x19', 26 => '\x1A', 27 => '\x1B', 28 => '\x1C', 29 => '\x1D', 30 => '\x1E', 31 => '\x80', 32 => '\x81', 33 => '\x82', 34 => '\x83', 35 => '\x84', 36 => '\x85', 37 => '\x86', 38 => '\x87', 39 => '\x88', 40 => '\x89', 41 => '\x8A', 42 => '\x8B', 43 => '\x8C', 44 => '\x8D', 45 => '\x8E', 46 => '\x8F', 47 => '\x90', 48 => '\x91', 49 => '\x92', 50 => '\x93', 51 => '\x94', 52 => '\x95', 53 => '\x96', 54 => '\x97', 55 => '\x98', 56 => '\x99', 57 => '\x9A', 58 => '\x9B', 59 => '\x9C', 60 => '\x9D', 61 => '\x9E', 62 => '\x9F', 63 => '\xA0', 64 => '\xA1', 65 => '\xA2', 66 => '\xA3', 67 => '\xA4', 68 => '\xA5', 69 => '\xA6', 70 => '\xA7', 71 => '\xA8', 72 => '\xA9', 73 => '\xAA', 74 => '\xAB', 75 => '\xAC', 76 => '\xAD', 77 => '\xAE', 78 => '\xAF', 79 => '\xB0', 80 => '\xB1', 81 => '\xB2', 82 => '\xB3', 83 => '\xB4', 84 => '\xB5', 85 => '\xB6', 86 => '\xB7', 87 => '\xB8', 88 => '\xB9', 89 => '\xBA', 90 => '\xBB', 91 => '\xBC', 92 => '\xBD', 93 => '\xBE', 94 => '\xBF', 95 => '\xC0', 96 => '\xC1', 97 => '\xC2', 98 => '\xC3', 99 => '\xC4', 100 => '\xC5', 101 => '\xC6', 102 => '\xC7', 103 => '\xC8', 104 => '\xC9', 105 => '\xCA', 106 => '\xCB', 107 => '\xCC', 108 => '\xCD', 109 => '\xCE', 110 => '\xCF', 111 => '\xD0', 112 => '\xD1', 113 => '\xD2', 114 => '\xD3', 115 => '\xD4', 116 => '\xD5', 117 => '\xD6', 118 => '\xD7', 119 => '\xD8', 120 => '\xD9', 121 => '\xDA', 122 => '\xDB', 123 => '\xDC', 124 => '\xDD', 125 => '\xDE', 126 => '\xDF', 127 => '\xE0', 128 => '\xE1', 129 => '\xE2', 130 => '\xE3', 131 => '\xE4', 132 => '\xE5', 133 => '\xE6', 134 => '\xE7', 135 => '\xE8', 136 => '\xE9', 137 => '\xEA', 138 => '\xEB', 139 => '\xEC', 140 => '\xED', 141 => '\xEE', 142 => '\xEF', 143 => '\xF0', 144 => '\xF1', 145 => '\xF2', 146 => '\xF3', 147 => '\xF4', 148 => '\xF5', 149 => '\xF6', 150 => '\xF7', 151 => '\xF8', 152 => '\xF9', 153 => '\xFA', 154 => '\xFB', 155 => '\xFC', 156 => '\xFD', 157 => '\xFE', );
private static $replace = array ( 0 => '�', 1 => '', 2 => '', 3 => '', 4 => '', 5 => '', 6 => '', 7 => '', 8 => '', 9 => '	', 10 => ' ', 11 => '', 12 => '', 13 => ' ', 14 => '', 15 => '', 16 => '', 17 => '', 18 => '', 19 => '', 20 => '', 21 => '', 22 => '', 23 => '', 24 => '', 25 => '', 26 => '', 27 => '', 28 => '', 29 => '', 30 => '', 31 => '€', 32 => '', 33 => '‚', 34 => 'ƒ', 35 => '„', 36 => '…', 37 => '†', 38 => '‡', 39 => 'ˆ', 40 => '‰', 41 => 'Š', 42 => '‹', 43 => 'Œ', 44 => '', 45 => 'Ž', 46 => '', 47 => '', 48 => '‘', 49 => '’', 50 => '“', 51 => '”', 52 => '•', 53 => '–', 54 => '—', 55 => '˜', 56 => '™', 57 => 'š', 58 => '›', 59 => 'œ', 60 => '', 61 => 'ž', 62 => 'Ÿ', 63 => ' ', 64 => '¡', 65 => '¢', 66 => '£', 67 => '¤', 68 => '¥', 69 => '¦', 70 => '§', 71 => '¨', 72 => '©', 73 => 'ª', 74 => '«', 75 => '¬', 76 => '­', 77 => '®', 78 => '¯', 79 => '°', 80 => '±', 81 => '²', 82 => '³', 83 => '´', 84 => 'µ', 85 => '¶', 86 => '·', 87 => '¸', 88 => '¹', 89 => 'º', 90 => '»', 91 => '¼', 92 => '½', 93 => '¾', 94 => '¿', 95 => 'À', 96 => 'Á', 97 => 'Â', 98 => 'Ã', 99 => 'Ä', 100 => 'Å', 101 => 'Æ', 102 => 'Ç', 103 => 'È', 104 => 'É', 105 => 'Ê', 106 => 'Ë', 107 => 'Ì', 108 => 'Í', 109 => 'Î', 110 => 'Ï', 111 => 'Ð', 112 => 'Ñ', 113 => 'Ò', 114 => 'Ó', 115 => 'Ô', 116 => 'Õ', 117 => 'Ö', 118 => '×', 119 => 'Ø', 120 => 'Ù', 121 => 'Ú', 122 => 'Û', 123 => 'Ü', 124 => 'Ý', 125 => 'Þ', 126 => 'ß', 127 => 'à', 128 => 'á', 129 => 'â', 130 => 'ã', 131 => 'ä', 132 => 'å', 133 => 'æ', 134 => 'ç', 135 => 'è', 136 => 'é', 137 => 'ê', 138 => 'ë', 139 => 'ì', 140 => 'í', 141 => 'î', 142 => 'ï', 143 => 'ð', 144 => 'ñ', 145 => 'ò', 146 => 'ó', 147 => 'ô', 148 => 'õ', 149 => 'ö', 150 => '÷', 151 => 'ø', 152 => 'ù', 153 => 'ú', 154 => 'û', 155 => 'ü', 156 => 'ý', 157 => 'þ', );
/*
* replace illegal characters for escaped html character but don't touch anything else.
*/
public static function getSaveValue($value) {
return str_replace(self::$find, self::$replace, $value);
}
public static function makeIllegal($find,$replace) {
self::$find[] = $find;
self::$replace[] = $replace;
}
}
Comments
This worked fine for me. If you have a string "Letters u00e1 u00e9 etc." replace by "Letters á é".
function unicode2html($str){
// Set the locale to something that's UTF-8 capable
setlocale(LC_ALL, 'en_US.UTF-8');
// Convert the codepoints to entities
$str = preg_replace("/u([0-9a-fA-F]{4})/", "&#x\\1;", $str);
// Convert the entities to a UTF-8 string
return iconv("UTF-8", "ISO-8859-1//TRANSLIT", $str);
}
Comments
With the aid of the following table:
http://en.wikipedia.org/wiki/UTF-8#Description
can't be simpler :)
Simply mask the unicode numbers according to which range they fit in.