21

I am using a utf8 charset mysql tables in a mysql 5.1 server, which does not support utf8mb4 encoding in tables. When inserting 4-byte encoded utf8 characters like "𡃁","𨋢","𠵱","𥄫","𠽌","唧","𠱁". The table will popup error or skip the following texts.

How can I programmatically detect 4-byte encoded utf8 characters in PHP and replace them?

9
  • Pretty simple: split a string by characters (many ways to do so) and check if strlen($char) == 4. Not sure if this is really the correct way to detect the characters MySQL can't handle though, going by code point may be more accurate. Commented May 11, 2013 at 11:26
  • Have you checked out the multibyte extension? Also, be sure to always read the comments. Commented May 11, 2013 at 11:30
  • @deceze That's an approach. I will go for that if there aren't other elegant ways. Commented May 11, 2013 at 11:36
  • See this related question; I know it's Python, but you could use a regex to check for 4-byte characters. Commented May 11, 2013 at 11:36
  • @cbuckley do you know is \U also valid in php? Commented May 11, 2013 at 11:50

2 Answers 2

19

The following regular expression will replace 4-byte UTF-8 characters:

function replace4byte($string, $replacement = '') {
    return preg_replace('%(?:
          \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
    )%xs', $replacement, $string);    
}

var_dump(replace4byte('d'), replace4byte('d𡃁d'));

This doesn't rely on the /u modifier, so you shouldn't need to worry about UTF-8 for PCRE being compiled in. However, if you have that support, deceze's preg_replace_callback is neater.

(Regex adapted from Ensuring valid utf-8 in PHP)

Sign up to request clarification or add additional context in comments.

Comments

17

This should work:

if (max(array_map('ord', str_split($string))) >= 240) 

The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.

If you want to remove long characters, this will do:

preg_replace_callback('/./u', function (array $match) {
    return strlen($match[0]) >= 4 ? null : $match[0];
}, $string)

Though there may be a more elegant regex way to express high codepoints directly.

1 Comment

Thanks for detection but can you finish it with a replacement example too? $a = "omg, I cannot insert 𡃁 into my table, blahblahblah"; //target $a == "omg, I cannot insert MYTEXT into my table, blahblahblah";

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.