7

I cannot solve this issue and I'm getting crazy.

JSON_encode() is casting the error: Malformed UTF-8 characters, possibly incorrectly encoded on few records (2 or 3) from a set of 10k records. However this seems very impossible to fix.

  • mysql is already utf8mb4 everywhere (database, table, columns and collation)
  • php is 7.2 and of course in utf8
  • apache default charset is utf8 (however the error is throw at PHP-level).

I can also print to screen correctly the record in PHP without issue in a simple HTML debug page. However If I try to encode it in JSON I get the error.

I found that these records have been imported from a CVS probably bypassing the cleaner. What is so strange is that the entire CSV file is parsed with:

$this->encoding = mb_detect_encoding($source,mb_detect_order(),true);
if ($this->encoding!="" && $this->encoding!="UTF8") {
    $source = iconv($this->encoding, "UTF-8", $source);
} 

I cannot post any full broken data due to the privacy (and GDPR). However I succeed to extract a part which seems to be the broken one:

RESIDENCE �PRINCIPE

UPDATES

I try to get the bitcode of these broken chars. This is what I found. In ASCII by using simple native function str_split and ord these char is:

'�' 160

I would like to find the bitcode also in utf8, so I find this usefull function on PHP.net http://php.net/manual/en/function.ord.php#109812 Which try to find bitcode of MultiByteStrings. and it gives me:

-2096

Which is....... negative?

11
  • 2
    Were you going to show us an example of the "broken" data? Commented May 30, 2018 at 18:10
  • My unique example of broken data are wirebank transfers records, i cannot post these details on stackoverflown. Sorry (GDPR you know). Commented May 31, 2018 at 8:07
  • Can anybody let me know why people should downvote a question like this? Because really it seems to be just for badness. Commented May 31, 2018 at 8:09
  • You're getting the data from the database? Have you set the database connection encoding to utf8mb4? Note: not the column collation/charset, the connection encoding. Commented May 31, 2018 at 8:19
  • As I wrote: mysql is already utf8mb4 everywhere (database, table, columns and collation) Commented May 31, 2018 at 8:28

2 Answers 2

6

SOLVED!

The issue was in the function mb_detect_order(), this function just don't work as I was expecting. I was thinking this was a list of full supporting encoding order by mostly used in order to speed up the detection process.

But I just found that this function return just 2 encoding:

//print_r(mb_detect_order());
Array
(
    [0] => ASCII
    [1] => UTF-8
)

Which is almost completly useless in my case. MB functions can detect much more charset. You can check them out by run mb_list_encodings() and get the full list:

//print_r(mb_list_encodings());
Array
(
    [0] => pass
    [1] => auto
    [2] => wchar
    [3] => byte2be
    [4] => byte2le
    [5] => byte4be
    [6] => byte4le
    [7] => BASE64
    [8] => UUENCODE
    [9] => HTML-ENTITIES
    [10] => Quoted-Printable
    [11] => 7bit
    [12] => 8bit
    [13] => UCS-4
    [14] => UCS-4BE
    [15] => UCS-4LE
    [16] => UCS-2
    [17] => UCS-2BE
    [18] => UCS-2LE
    [19] => UTF-32
    [20] => UTF-32BE
    [21] => UTF-32LE
    [22] => UTF-16
    [23] => UTF-16BE
    [24] => UTF-16LE
    [25] => UTF-8
    [26] => UTF-7
    [27] => UTF7-IMAP
    [28] => ASCII
    [29] => EUC-JP
    [30] => SJIS
    [31] => eucJP-win
    [32] => EUC-JP-2004
    [33] => SJIS-win
    [34] => SJIS-Mobile#DOCOMO
    [35] => SJIS-Mobile#KDDI
    [36] => SJIS-Mobile#SOFTBANK
    [37] => SJIS-mac
    [38] => SJIS-2004
    [39] => UTF-8-Mobile#DOCOMO
    [40] => UTF-8-Mobile#KDDI-A
    [41] => UTF-8-Mobile#KDDI-B
    [42] => UTF-8-Mobile#SOFTBANK
    [43] => CP932
    [44] => CP51932
    [45] => JIS
    [46] => ISO-2022-JP
    [47] => ISO-2022-JP-MS
    [48] => GB18030
    [49] => Windows-1252
    [50] => Windows-1254
    [51] => ISO-8859-1
    [52] => ISO-8859-2
    [53] => ISO-8859-3
    [54] => ISO-8859-4
    [55] => ISO-8859-5
    [56] => ISO-8859-6
    [57] => ISO-8859-7
    [58] => ISO-8859-8
    [59] => ISO-8859-9
    [60] => ISO-8859-10
    [61] => ISO-8859-13
    [62] => ISO-8859-14
    [63] => ISO-8859-15
    [64] => ISO-8859-16
    [65] => EUC-CN
    [66] => CP936
    [67] => HZ
    [68] => EUC-TW
    [69] => BIG-5
    [70] => CP950
    [71] => EUC-KR
    [72] => UHC
    [73] => ISO-2022-KR
    [74] => Windows-1251
    [75] => CP866
    [76] => KOI8-R
    [77] => KOI8-U
    [78] => ArmSCII-8
    [79] => CP850
    [80] => JIS-ms
    [81] => ISO-2022-JP-2004
    [82] => ISO-2022-JP-MOBILE#KDDI
    [83] => CP50220
    [84] => CP50220raw
    [85] => CP50221
    [86] => CP50222
)

I was in wrong, thinking that mb_detect_order was just an ordered version of this list. The mb_detect_order is just.... useless. In order to encode in UTF8 in the right way use the following code:

$my_encoding_list = [
    "UTF-8",
    "UTF-7",
    "UTF-16",
    "UTF-32",
    "ISO-8859-16",
    "ISO-8859-15",
    "ISO-8859-10",
    "ISO-8859-1",
    "Windows-1254",
    "Windows-1252",
    "Windows-1251",
    "ASCII",
    //add yours preferred
];

//remove unsupported encodings
$encoding_list = array_intersect($my_encoding_list, mb_list_encodings());

//detect 'finally' the encoding
$this->encoding = mb_detect_encoding($source,$encoding_list,true);

This worked and solved my issue with bad data saved in the database.

Sign up to request clarification or add additional context in comments.

Comments

1

You can filter these unknown characters by using the UTF-8//IGNORE charset in your iconv method.

$this->encoding = mb_detect_encoding($source,mb_detect_order(),true);

if ($this->encoding!="" && $this->encoding!="UTF8") {
    $source = iconv($this->encoding, "UTF-8//IGNORE", $source);
} 

By using the //IGNORE after your charset, every characters that cannot be represented in the target charset will be silently discarded.

1 Comment

This is a good workaround, however I found probably the reason

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.