utf8_encode and mb_convert_encoding Conflicting Resuts [duplicate]

Question

I have a very annoying issue when transferring data from PHP (which is set up with default_charset = 'UTF-8') to MySQL, on an IIS environment with web.config set to:

<globalization requestEncoding="UTF-8" responseEncoding="UTF-8" />

Also tested on an Apache Environment with same default charset with same results. I also set MySQLi to set charset before any query:

$mysqli->set_charset('UTF-8');

I also have a set of prepared statements that search each character in a string to convert any none UTF-8 encoding to UTF-8 using:

mb_convert_encoding($char, 'UTF-8', mb_detect_encoding($char));

After trying and failing by parsing the whole string with the same process:

mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));

The content type of the file with the script is set to content-type charset UTF-8 on the header. And still none of this has fixed the issue.

I also, have set all collations of the MySQL tables and schemas to utf8mb4, and of course using `utf8mb4_unicode_520_ci'. Still, issue persists.

The function I wrote to split the string is as follows:

    static private function split_and_convert($value, $encoding = 'UTF-8') {
        $split = mb_str_split($value);
        $filter = [];
        foreach($split as $chr) {
            $from = mb_detect_encoding($chr);
            $encoded = ($from !== $encoding)
                ? mb_convert_encoding($chr, $encoding, $from)
                : $chr;
            $filter[] = $encoded;
            // echoing for testing
            echo 'char ' . $encoded . ' (' . $from . ')<br>';
        }
        return join('', $filter);
    }

Which echo the following trying to save Martínez:

char M (ASCII)
char a (ASCII)
char r (ASCII)
char t (ASCII)
char � (UTF-8)
char n (ASCII)
char e (ASCII)
char z (ASCII)

With MySQL throwing the following error:

Error No: 1366 - Incorrect string value: '\xEDnez' for column 'contactLast' at row 1

Still no joy. But if I do this, it works like a charm:

    static private function split_and_convert($value, $encoding = 'UTF-8') {
        $split = mb_str_split($value);
        $filter = [];
        foreach($split as $chr) {
            $from = mb_detect_encoding($chr);
            $encoded = ($from !== $encoding)
                ? mb_convert_encoding($chr, $encoding, $from)
                // set to actually encode the 'UTF-8' encoded char
                // I'm supposed not to do this, but is what works
                // which makes no sense
                : utf8_encode($chr);
            $filter[] = $encoded;
            // echoing for testing
            echo 'char ' . $encoded . ' (' . $from . ')<br>';
        }
        return join('', $filter);
    }

With the following results:

char M (ASCII)
char a (ASCII)
char r (ASCII)
char t (ASCII)
char í (UTF-8)
char n (ASCII)
char e (ASCII)
char z (ASCII)

And MySQL not throwing any error. If I replace the utf8_encode with mb_convert_encoding($encoded, 'UTF-8'), I get the same errors as not filtering the string.

I'm suspecting mb_detect_encoding is either detecting erroneously the encoding of the characters or simply the mb_convert_encoding is not doing it's job.

If anyone can help I will greatly appreciate it. I'm been searching for hours for a solution and none of the solutions offered have worked. And I find myself in a pickle because what works has been deprecated.

So, these are the list of the issues I'm facing:

mb_convert_encoding -> not encoding or encoding to something other than UTF8
mb_detect_encoding -> not detecting correctly
Strings is encoded as ASCII not UTF-8 as charset in header requested.
Confusion: Why I need to encode to UTF-8 an UTF-8 encoded character to work?

I don't undertand why you're using all those mb_convert_encoding() and mb_detect_encoding(). You shouldn't need to convert anything. — Olivier
– Olivier, Commented Nov 10, 2023 at 20:50
And beware that str_split() splits on bytes, not on chars. — Olivier
– Olivier, Commented Nov 10, 2023 at 21:04
utf8_encode() strictly converts from ISO-8859-1 to UTF-8. mb_convert_encoding() without the optional 3rd parameter specifying the source encoding will use the currently-configured default encoding from the PHP config. So if the former works, but not the latter, then it's likely that the configured default encoding doesn't match your input string, and you should be explicitly defining the input encoding. Additionally, text encoding cannot be reliably detected, and functions that purport to do so are guessing. Text encoding is metadata that you must be aware of at all times. — Sammitch
– Sammitch, Commented Nov 10, 2023 at 21:06
@Olivier that's why I'm asking the question. In theory, if a character or string is already UTF-8 should not need to be filtered. In theory. The reason for my frustration is that something is not detecting encoding properly. See the last list of issues. — raphie
– raphie, Commented Nov 10, 2023 at 21:25
@Sammitch that's a good point, the issue I'm having is that the detection is finding that the character is UTF-8 when apparently is not. I just applied those filters to test this theory and resulted to be true. Where I used the function utf8_decode should've not be needed at all. — raphie
– raphie, Commented Nov 10, 2023 at 21:29

Rick James · Accepted Answer · 2023-11-11 15:13:40Z

0

ED is the latin1 encoding for í. You should find what is encoding in latin1, then either change it to use UTF-8 (aka MySQL's CHARACTER SET = utf8mb4) or tell MySQL that your data is in CHARACTER SET latin1 and let it convert as it stores (INSERTs) and retrieves (SELECTs) it.

Do not use any conversion routines (mb_*), that tends to make the mess worse.

The UTF-8 (utf8mb4) encoding for acute-i is C3AD.

For more discussion, see "black diamond" in Trouble with UTF-8 characters; what I see is not what I stored

answered Nov 11, 2023 at 15:13

Rick James

144k15 gold badges144 silver badges255 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

raphie Over a year ago

You are correct, but that the essence of the question, í was already detected as to be UTF-8 but when was sent as is to MySQL (utf8mb4) it threw the error, which make no sense, then if I filter the already detected UTF character through utf8_encode, the error was not thrown. The issue was not the mb_* functions but a middleware that was over processing the charset filtering, and the middleware was using utf8_encode regardless if it was already an UTF-8 charset.

raphie · Accepted Answer · 2023-11-13 20:26:20Z

After looking for answers and long day of research this was the issue. Over processing. At some point during development a middleware was put in place to verify each POST/GET request and process each value to ensure the encoding sent was UTF-8. The code I'm refactoring already took care of processing the data.

The middleware was coded using utf8_encode, so, I'm suspecting the function check itself to avoid over processing. I have to refactor the middleman as well so there's no conflict.

So, that was the issue, two different codes were processing the encoding creating the conflict. As soon I refactor the middleware the issue desisted.

In conclusion, the problem is not mb_convert_encoding or mb_detect_encoding, if you have a similar issue, make sure the code you are working on is not using two different methods doing similar processes with different functions.

In my case, the middleware was not verifying or set to detect if the string was already encoded to UTF-8, it was just implementing the encoding regardless. And the reason I've been paid to refactor the code.

Still what does not make sense to me is why a properly UTF-8 encoded string caused MySQL to throw an error. For that I suspect the over processing was corrupting the input.

Collectives™ on Stack Overflow

utf8_encode and mb_convert_encoding Conflicting Resuts [duplicate]

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related