If you require all strings that have associated named entities to be translated, use htmlentities() instead, that function is identical to htmlspecialchars() in all ways, except with htmlentities(), all characters which have HTML character entity equivalents are translated into these entities.
but even htmlentities() does not encode all unicode characters. It encodes what it can [all of latin1], and the others slip through (e.g. `Љ).
This function consults an ansii table to custom include/omit chars you want/don't.
(note: sure it's not that fast)
/**
* Unicode-proof htmlentities.
* Returns 'normal' chars as chars and weirdos as numeric html entites.
* @param string $str input string
* @return string encoded output
*/
function superentities( $str ){
// get rid of existing entities else double-escape
$str = html_entity_decode(stripslashes($str),ENT_QUOTES,'UTF-8');
$ar = preg_split('/(?<!^)(?!$)/u', $str ); // return array of every multi-byte character
foreach ($ar as $c){
$o = ord($c);
if ( (strlen($c) > 1) || /* multi-byte [unicode] */
($o <32 || $o > 126) || /* <- control / latin weirdos -> */
($o >33 && $o < 40) ||/* quotes + ambersand */
($o >59 && $o < 63) /* html */
) {
// convert to numeric entity
$c = mb_encode_numericentity($c,array (0x0, 0xffff, 0, 0xffff), 'UTF-8');
}
$str2 .= $c;
}
return $str2;
}