2

Here is the problem, in a working unit test. I think it's either a bug in DBI + DBD::mysql, with respect to how it handles MySQL JSON columns, or a bug in my brain.

use strict;
use warnings;

use utf8;

use Test2::V0;
use Test2::Plugin::UTF8;
use Test2::Plugin::NoWarnings echo => 1;

use DBI;
use DBD::mysql 4.041;  # 4.041+ required for utf8mb4
use JSON 4.01 qw//;
use Encode;

#
# setup $dbh, create test table
#
my $dbname = '';
my $host = 'localhost';
my $user = '';
my $pass = '';

my $dbh = DBI->connect(
  "DBI:mysql:" . ($dbname || '') . ";host=" . $host,
  $user, 
  $pass || undef, 
  { RaiseError => 1, PrintError => 0, AutoCommit=> 1 }
);

$dbh->{'mysql_enable_utf8mb4'} = 1;
$dbh->{'charset'} = 'utf8';

$dbh->do(
  "CREATE TABLE IF NOT EXISTS `test` ("
  . "id int unsigned, "
  . "`my_json` json NOT NULL, "
  . "`my_text` mediumtext NOT NULL "
  . ") ENGINE=InnoDB "
  . "DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci"
);

#
# create and insert test data
#

# A use case for spam! Got this junk from spam inbox
my $utf8str = "ion?• été e😍💋🔥ulière vs• Ch• 😊⭐👉🏻🔞🍆Sho是ab 期待您x";  
my $hash = { test => $utf8str };
my $json = JSON->new->encode( $hash );

my $id = time;
$dbh->do("INSERT INTO test SET id=?, my_json=?, my_text=?", undef, $id, $json, $json);

#
# retrieve test data and check it
#
my ( $my_json, $my_text ) = $dbh->selectrow_array("SELECT my_json, my_text FROM test WHERE id=$id");

is( $my_text, $json );  # ok

is( $my_json, $json );  # fails. got {"test": "ion?\nâ\N{U+80}¢ ét .... 

is( decode('UTF-8', $my_json), $json );  # ok'ish. mysql adds a space between "test":"..." but value looks ok

#
# another test, independent of JSON encoder, using hand-built json
#
$id++;
$json = '{"test":"' . $utf8str . '"}';
$dbh->do("INSERT INTO test SET id=?, my_json=?, my_text=?", undef, $id, $json, $json);

( $my_json, $my_text ) = $dbh->selectrow_array("SELECT my_json, my_text FROM test WHERE id=$id");

is( $my_text, $json );  # ok

is( $my_json, $json );  # fails. got {"test": "ion?\nâ\N{U+80}¢ ét .... 

printf "%vX", $my_json;  # 7B.22.74.65.73.74.22.3A.20.22.69.6F.6E.3F.E2.80.A2.20.C3.A9.74.C3.A9.20.65.F0.9F.98.8D.F0.9F.92.8B.F0.9F.94.A5.75.6C.69.C3.A8.72.65.20.76.73.E2.80.A2.20.43.68.E2.80.A2.20.F0.9F.98.8A.E2.AD.90.F0.9F.91.89.F0.9F.8F.BB.F0.9F.94.9E.F0.9F.8D.86.53.68.6F.E6.98.AF.61.62.20.E6.9C.9F.E5.BE.85.E6.82.A8.78.22.7D

printf "%vX", $json;  # 7B.22.74.65.73.74.22.3A.22.69.6F.6E.3F.2022.20.E9.74.E9.20.65.1F60D.1F48B.1F525.75.6C.69.E8.72.65.20.76.73.2022.20.43.68.2022.20.1F60A.2B50.1F449.1F3FB.1F51E.1F346.53.68.6F.662F.61.62.20.671F.5F85.60A8.78.22.7D

is( decode('UTF-8', $my_json), $json );  # ok'ish. mysql adds a space between "test":"..." but value looks ok

#
# cleanup
#
$dbh->do("DROP TABLE `test`");

$dbh->disconnect();

done_testing();

My understanding is that the JSON standard requires UTF-8. In addition, MySQL also requires/uses UTF-8 with regard to JSON columns, as described here:

MySQL handles strings used in JSON context using the utf8mb4 character set and utf8mb4_bin collation. Strings in other character sets are converted to utf8mb4 as necessary. (https://dev.mysql.com/doc/refman/8.0/en/json.html)

My understanding is also that DBI handles the UTF-8 encoding/decoding, and should be returning decoded UTF-8 as it is doing for the mediumtext column, as stated here:

This attribute determines whether DBD::mysql should assume strings stored in the database are utf8. This feature defaults to off.

When set, a data retrieved from a textual column type (char, varchar, etc) will have the UTF-8 flag turned on if necessary. This enables character semantics on that string. https://metacpan.org/pod/DBD::mysql#mysql_enable_utf8

However, it appears not to be for JSON columns. Explicit decoding appears to be required after retrieving data from a JSON column.

So which is it... bug in DBI/DBD::mysql, or bug in my brain?

EDIT: Good news, it's not my brain. Bad news, appears to be a known bug. https://github.com/perl5-dbi/DBD-mysql/issues/309

So, the answer I'm seeking now is a backward-compatible workaround, i.e., a workaround that won't break if/when DBD::mysql is fixed. Double-decoding would not be good.

5
  • 1
    JSON->new->encode expects strings of Code Points, but you provided a string encoded using UTF-8 instead. You need use utf8; to tell Perl your source code is encoded using UTF-8 so that it can decode it properly. There may be other problems. Commented Jan 15, 2022 at 19:47
  • decode('UTF-8', $my_json) makes no sense. $dbh->{'mysql_enable_utf8mb4'} = 1; should have decoded it already. Commented Jan 15, 2022 at 19:48
  • It would help if you print the various variables using printf "%vX". Commented Jan 15, 2022 at 19:50
  • $dbh->{'mysql_enable_utf8mb4'} = 1 doesn't appear to work for JSON columns. After posting, I found this: github.com/perl5-dbi/DBD-mysql/issues/309, so now I'm wondering how to fix this issue in a way that is backward-compatible if/when DBD::mysql gets fixed. Commented Jan 15, 2022 at 19:56
  • added use utf8; and printf "%vX" as suggested, thx! Commented Jan 15, 2022 at 20:04

3 Answers 3

1

So, the answer I'm seeking now is a backward-compatible workaround, i.e., a workaround that won't break if/when DBD::mysql is fixed. Double-decoding would not be good.

You could try to determine if the JSON decode bug is present by creating a test table where you insert a non-ascii character that has a known UTF-8 encoding with byte length greater than one. For example:

$dbh->do("DROP TABLE IF EXISTS json_decode_test");
$dbh->do("CREATE TABLE json_decode_test (id int unsigned, `my_json` json NOT NULL)");
my $unicode_str = "是";  # This character will always have a UTF-8 encoding with
                         # byte length > 1
my $hash = { test_str => $unicode_str };
my $json = JSON->new;
my $json_str = $json->encode( $hash );
my $id = time;
my $attrs = undef;
$dbh->do("INSERT INTO json_decode_test SET id=?, my_json=?", $attrs, $id, $json_str);
my ( $json_str2 ) = $dbh->selectrow_array(
    "SELECT my_json FROM json_decode_test WHERE id=$id");
my $hash2 = $json->decode( $json_str2 );
my $unicode_str2 = $hash2->{test_str};
# If the json unicode bug is present, $unicode_str2 will not be decoded. Instead
#  it will be a string of length 3 representing the UTF-8 encoding of $unicode_str
#  (printf "%vX\n", $unicode_str2) gives output : E6.98.AF
my $json_unicode_bug = (length $unicode_str2) > 1;
if ( $json_unicode_bug ) {
    say "unicode bug is present..";
    # need to run decode_utf8() on every returned json object from DBI
}
Sign up to request clarification or add additional context in comments.

4 Comments

Something like this will work as a unit test, thx!! I was hoping for an on-the-fly workaround. I'm thinking about using utf8::is_utf8 to detect whether the string has already been decoded.
You are welcome. I do not think utf8::is_utf8() will work here, see this answer
Yep, I agree ut8::is_utf8 should not be used in general. The question I was considering was whether it would work reliably for THIS specific case of retrieving an unencoded JSON column from MySQL using the present buggy version of DBD::mysql. I can't convince myself that it will, because there's no way to reliably know how utf8::is_utf8() would act in a fixed DBD::mysql, i.e. a DBD::mysql that doesn't yet exist!
The complete and best available solution appears to be 1. when retrieving, decode( 'UTF-8', $my_json ) prior to json-decoding, and 2. add a unit test (as you suggested) to confirm bug is acting as predicted, in case DBD::mysql is fixed at a later date. Does that seem right?
0

Change SQL request to create test table as following

my $query = "
CREATE TABLE IF NOT EXISTS `test` (
    `id`       INT UNSIGNED,
    `my_json`  JSON NOT NULL,
    `my_text`  MEDIUMTEXT NOT NULL
) ENGINE=InnoDB DEFAULT
  CHARSET=utf8mb4
  COLLATE=utf8mb4_unicode_ci
";

$dbh->do($query);

2 Comments

Doesn't work. Still getting undecoded my_json
To be clear, I added ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci to the test and re-ran, got same result. For consistency I added the above to the original question. Thx for suggesting
0
F0.9F.98.8D -- UTF-8 encoding -- this is good
1F60D -- Unicode codepoint -- not useful in `MEDIUMTEXT`

Do

SELECT HEX(my_json) FROM test WHERE id = ...

to see what is inside MySQL; it should be F09F988D.

You should not be encoding or decoding any string that is written to or read from MySQL.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.