Can php detect 4-byte encoded utf8 chars?

Question

I am using a utf8 charset mysql tables in a mysql 5.1 server, which does not support utf8mb4 encoding in tables. When inserting 4-byte encoded utf8 characters like "𡃁","𨋢","𠵱","𥄫","𠽌","唧","𠱁". The table will popup error or skip the following texts.

How can I programmatically detect 4-byte encoded utf8 characters in PHP and replace them?

Pretty simple: split a string by characters (many ways to do so) and check if strlen($char) == 4. Not sure if this is really the correct way to detect the characters MySQL can't handle though, going by code point may be more accurate. — deceze
– deceze ♦, Commented May 11, 2013 at 11:26
Have you checked out the multibyte extension? Also, be sure to always read the comments. — Sverri M. Olsen
– Sverri M. Olsen, Commented May 11, 2013 at 11:30
@deceze That's an approach. I will go for that if there aren't other elegant ways. — Abby Chau Yu Hoi
– Abby Chau Yu Hoi, Commented May 11, 2013 at 11:36
See this related question; I know it's Python, but you could use a regex to check for 4-byte characters. — cmbuckley
– cmbuckley, Commented May 11, 2013 at 11:36

cmbuckley · Accepted Answer · 2018-02-09 13:39:04Z

19

The following regular expression will replace 4-byte UTF-8 characters:

function replace4byte($string, $replacement = '') {
    return preg_replace('%(?:
          \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
    )%xs', $replacement, $string);    
}

var_dump(replace4byte('d'), replace4byte('d𡃁d'));

This doesn't rely on the /u modifier, so you shouldn't need to worry about UTF-8 for PCRE being compiled in. However, if you have that support, deceze's preg_replace_callback is neater.

(Regex adapted from Ensuring valid utf-8 in PHP)

edited Feb 9, 2018 at 13:39

answered May 11, 2013 at 11:53

cmbuckley

43k10 gold badges83 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2014-10-08 13:44:04Z

17

This should work:

if (max(array_map('ord', str_split($string))) >= 240)

The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.

If you want to remove long characters, this will do:

preg_replace_callback('/./u', function (array $match) {
    return strlen($match[0]) >= 4 ? null : $match[0];
}, $string)

Though there may be a more elegant regex way to express high codepoints directly.

edited Oct 8, 2014 at 13:44

CommunityBot

11 silver badge

answered May 11, 2013 at 11:45

deceze♦

525k89 gold badges807 silver badges954 bronze badges

1 Comment

Abby Chau Yu Hoi Over a year ago

Thanks for detection but can you finish it with a replacement example too? $a = "omg, I cannot insert 𡃁 into my table, blahblahblah"; //target $a == "omg, I cannot insert MYTEXT into my table, blahblahblah";

Collectives™ on Stack Overflow

Can php detect 4-byte encoded utf8 chars?

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related