15

I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..

$accents_search     = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'); 

$accents_replace    = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n'); 

$str = str_replace($accents_search, $accents_replace, $str);

Results I get:

Ørjan Nilsen -> �orjan Nilsen

Expected Result:

Ørjan Nilsen -> Orjan Nilsen

Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?

0

4 Answers 4

26

According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you, dav. This should be the correct answer, because it explains why there are mb_substr() and mb_strlen() but not mb_str_replace(). The first two functions use (or return) offset positions for text characters (which depends on the text encoding) while str_replace() not. That's why str_replace() can work with UTF-8 data safely (or any other Unicode encoding or generally with binary data).
For this to work, it is mandatory that php file with $accents_search strings is saved as UTF-8. So that all parameters to str_replace() are UTF-8. More info on php string literals encoding
Neither the answer nor the link explains it. But your comment does, Stan. Thank you.
6

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

3 Comments

Aye, UTF-8 file run on cli to a text file (dont output to iso terminal) works.
So how can I change my Input encoding then?
If you do a $str = "Ørjan Nilsen" at the beginning, and print $str out at the end, does it give you the right answer? If you read from cli to initialize $str then it may not be set with proper encoding.
4

It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.

NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).

$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));

$test = Normalizer::normalize($test, Normalizer::FORM_D);

// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';

$result = preg_replace($pattern, '', $test);

// Re-encode in NFC (we assume that "UTF-8" more or less means "UTF-8 in NFC").
// (I'm not 100 % sure this is necessary. But it won't do any harm.)
$resultNfc = Normalizer::normalize($result, Normalizer::FORM_C);

var_dump($resultNfc);
    // string(55) "aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn"

The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)

3 Comments

Thanks, that's actually pretty useful. Though I don't really want to use that in this instance because it results in the loss of accents.
I thought that getting rid of accents was what you were trying to do?
The goal is to replace accented characters with their equivalent non-accented character.
2

Try this function definition:

if (!function_exists('mb_str_replace')) {
    function mb_str_replace($search, $replace, $subject) {
        if (is_array($subject)) {
            foreach ($subject as $key => $val) {
                $subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
            }
            return $subject;
        }
        $pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
        if (is_array($search)) {
            if (is_array($replace)) {
                $len = min(count($search), count($replace));
                $table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
                $f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
                $subject = preg_replace_callback($pattern, $f, $subject);
                return $subject;
            }
        }
        $subject = preg_replace($pattern, (string)$replace, $subject);
        return $subject;
    }
}

1 Comment

maybe I've mistaken, but it seems the correct pattern would be: '/('.preg_quote(implode('', (array)$search), '/').')/u'??

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.