PHP Multi Byte str_replace?

Question

I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..

$accents_search     = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'); 

$accents_replace    = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n'); 

$str = str_replace($accents_search, $accents_replace, $str);

Results I get:

Ørjan Nilsen -> �orjan Nilsen

Expected Result:

Ørjan Nilsen -> Orjan Nilsen

Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?

dav · Accepted Answer · 2013-04-10 16:44:46Z

26

According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.

edited Apr 10, 2013 at 16:44

answered Oct 24, 2012 at 14:20

dav

9,32716 gold badges81 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user1854856 Over a year ago

Thank you, dav. This should be the correct answer, because it explains why there are mb_substr() and mb_strlen() but not mb_str_replace(). The first two functions use (or return) offset positions for text characters (which depends on the text encoding) while str_replace() not. That's why str_replace() can work with UTF-8 data safely (or any other Unicode encoding or generally with binary data).

Vasiliy Zverev Over a year ago

For this to work, it is mandatory that php file with $accents_search strings is saved as UTF-8. So that all parameters to str_replace() are UTF-8. More info on php string literals encoding

WoodrowShigeru Over a year ago

Neither the answer nor the link explains it. But your comment does, Stan. Thank you.

phsiao · Accepted Answer · 2009-09-20 14:45:07Z

6

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

answered Sep 20, 2009 at 14:45

phsiao

1,56710 silver badges7 bronze badges

3 Comments

OIS Over a year ago

Aye, UTF-8 file run on cli to a text file (dont output to iso terminal) works.

Ian Over a year ago

So how can I change my Input encoding then?

phsiao Over a year ago

If you do a $str = "Ørjan Nilsen" at the beginning, and print $str out at the end, does it give you the right answer? If you read from cli to initialize $str then it may not be set with proper encoding.

mermshaus · Accepted Answer · 2024-01-11 13:27:11Z

4

It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.

NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).

$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));

$test = Normalizer::normalize($test, Normalizer::FORM_D);

// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';

$result = preg_replace($pattern, '', $test);

// Re-encode in NFC (we assume that "UTF-8" more or less means "UTF-8 in NFC").
// (I'm not 100 % sure this is necessary. But it won't do any harm.)
$resultNfc = Normalizer::normalize($result, Normalizer::FORM_C);

var_dump($resultNfc);
    // string(55) "aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn"

The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)

edited Jan 11, 2024 at 13:27

answered Nov 18, 2009 at 22:30

mermshaus

6461 gold badge5 silver badges19 bronze badges

3 Comments

Ian Over a year ago

Thanks, that's actually pretty useful. Though I don't really want to use that in this instance because it results in the loss of accents.

mermshaus Over a year ago

I thought that getting rid of accents was what you were trying to do?

mickmackusa Over a year ago

The goal is to replace accented characters with their equivalent non-accented character.

Gumbo · Accepted Answer · 2011-06-14 14:14:59Z

2

Try this function definition:

if (!function_exists('mb_str_replace')) {
    function mb_str_replace($search, $replace, $subject) {
        if (is_array($subject)) {
            foreach ($subject as $key => $val) {
                $subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
            }
            return $subject;
        }
        $pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
        if (is_array($search)) {
            if (is_array($replace)) {
                $len = min(count($search), count($replace));
                $table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
                $f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
                $subject = preg_replace_callback($pattern, $f, $subject);
                return $subject;
            }
        }
        $subject = preg_replace($pattern, (string)$replace, $subject);
        return $subject;
    }
}

edited Jun 14, 2011 at 14:14

answered Sep 20, 2009 at 15:01

Gumbo

657k112 gold badges792 silver badges852 bronze badges

1 Comment

Igor Over a year ago

maybe I've mistaken, but it seems the correct pattern would be: '/('.preg_quote(implode('', (array)$search), '/').')/u'??

Collectives™ on Stack Overflow

PHP Multi Byte str_replace?

4 Answers 4

3 Comments

3 Comments

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

3 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related