0

I'm importing emails from a Gmail account and storing them in a database. Apparently my PHP logic to convert everything to UTF-8 has failed with one particular email and when I run the query with that email I receive the following MariaDB error:

Incorrect string value: '\x96 conv...'

Here is the minimized version of the SQL query:

INSERT INTO mail (subject) VALUES ('orkut – convite...');

I eventually tracked this down to a header subject string. In my general MariaDB query log the string appears as intended with the en dash (): orkut – conv.... When PHP attempts to run the query I get the error Incorrect string value as mentioned above however when I manually run the exact same query via HeidiSQL the query executes just fine!

The header is split between the key and value and then the value is processed as follows:

$p = explode(': ', $header, 2);
$s = mb_convert_encoding(trim($p[1]), 'UTF-8', mb_detect_encoding($p[1]))

Now I've tried many things and the closest I could get was cmbuckly's answer on another thread however using $s = iconv('utf-8', 'windows-1252', $s); did not resolve the problem in this circumstance. I have heaps of other attempts like using utf8_encode($s) afterwards though I'm just at a loss at this point.

  • My MariaDB database has the proper utf8mb4 character set and utf8mb4_unicode_520_ci database collation.
  • I am not going to try to use MariaDB to encode/re-encode text, this is explicitly intended as a PHP question.

What is causing MariaDB to not work with the query via PHP though allows the query to work via HeidiSQL? Have I properly decoded the string?

1 Answer 1

1

https://www.fileformat.info/info/unicode/char/2013/charset_support.htm

The only time an en-dash has a 0x96 byte in it is when a windows cp125x encoding is used.

Figure out which one your source data uses and convert it to UTF8 with

$utf8_data = mb_convert_encoding($data, 'UTF-8', 'cp125x');

Additionally: UTF-8 all the way through

Sign up to request clarification or add additional context in comments.

1 Comment

This directly answers the question perfectly - thank you. The overall issue with no doubt likely extends to numerous other characters that when in a string causes PHP to fail to determine the proper character encoding so after much work I've put together a thread to help identify more troublesome characters so we can avoid data corruption: stackoverflow.com/questions/70045981/….

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.