2

More than three years ago I posted a similar question related to not being able to insert data into a table because of what I believe is an incorrect character set, and I'm having the same problem with the same script with different data today.

The issue is with German characters from Ilpo J�rvinen ("a" with umlaut?).

execute failed: Incorrect string value: '\xD6sterl...' with mariadb and perl DBD

DBD::mysql::st execute failed: Incorrect string value: '\xE4rvine...' for column `lsv6`.`xyz_content`.`fulltext` at row 1 at myscript.pl line 467.

I'm trying to process emails with different charsets, but apparently this one is not being encoded by my script properly. From the email:

Content-type: text/plain; charset="iso-8859-1"
Content-transfer-encoding: quoted-printable

The full script is too long to post here, but I believe these are the relevant parts where the database is involved.

sub db_connect($) {
    ...
    return DBI->connect("DBI:mysql:database=$DB{'db'};host=$DB{'host'}", $DB{'user'}, $DB{'pass'}, { PrintError => 1, RaiseError => 1, mysql_enable_utf8mb4 => 1 } )
    ...
}

$fullText               = $dbh->quote($fullText);
my $sql                 = <<EOF;
INSERT INTO xuxgc_content (title, alias, introtext, `fulltext`, state, catid, created, created_by, created_by_alias, modified, modified_by, checked_out, checked_out_time, publish_up, publish_down, images, urls, attribs, version, ordering, metakey, metadesc, metadata, access, hits, language)
VALUES ($title, "$title_alias", $introText, $fullText, $state, $catid, $created, $created_by, $created_by_alias, $modified, $modified_by, $checked_out, $checked_out_time, $publish_up, $publish_down, $images, $urls, $attribs, $version, $ordering, $metakey, $metadesc, $metadata, $access, $hits, $language);
EOF

my $sth = $dbh->prepare($sql);
$sth->execute();
db_disconnect($dbh);

If I strip out the non-ASCII characters, it works properly:

$fullText               =~ s/\xE4//g;                                                                                                                                                                              
$fullText               =~ s/\xFC//g;                                                                                                                                                                              
$fullText               =~ s/\xE5//g;                    

Should I be using iso-8859-1 encoding instead of utf8mb4?

2
  • Does it work if you add $fullText = decode( "iso-8859-1", $fullText ); Commented Oct 22, 2022 at 1:21
  • Yes, that worked, thanks. How much of the other decoding/quoting do I need to do? I changed the encoding type to utf8mb4 in my DBI->connect - do I need to know what the charset of the email that is being used? Commented Oct 23, 2022 at 2:09

1 Answer 1

2

$fullText is a string of text encoded using iso-8859-1. It needs to be decode into a string of text before working with it.

And DBD::mysql expects strings to be encoded appropriately for the connection. In your case, this is UTF-8.[1] So you need to encode the query into a string of bytes before passing it to DBD::mysql.

# Convert bytes to text
my $fullText = decode( "iso-8859-1", $fullText_latin1 );

# Convert text to SQL string literal
my $fullText_lit = $dbh->quote( $fullText );

my $sql = "... $fullText_lit ...";

# Convert text to bytes
my $sql_utf8 = encode( "UTF-8", $sql );

my $sth = $dbh->prepare( $sql_utf8 );

Note that DBD::mysql suffers from The Unicode Bug.

Perl has two internal storage formats for strings: "downgraded" and "upgraded". It shouldn't matter which internal storage format is used for a string. But DBD::mysql (and every other DBD I know) access the internal string buffer without checking which of the two storage format is used. Code that does this is said to suffer from The Unicode Bug.

So what I said earlier isn't quite right. DBD::mysql expects the internal buffer of the strings to be encoded appropriately for the connection. encode always produces a string in the downgraded format, and the internal buffer of such strings always contain the string exactly.

This means that as long as you encode the strings passed to DBD::mysql using encode and the proper encoding, you won't have problems.


  1. One of the effects of passing mysql_enable_utf8mb4 => 1 as a connection option is to set the encoding connection to UTF-8 as if you had used SET NAMES utf8mb4.
Sign up to request clarification or add additional context in comments.

4 Comments

I think there is something I'm still not understanding. Can I ask you to lay this out more clearly for me? This particular email is encoded using ISO-8859-1 to support the binary characters, but others are encoded using UTF-8. It appears you are saying to encode the text, but the example you gave above was to decode it. Do I need to distinguish between the two to decide whether to encode/decode? It also appears Joomla requires UTF-8, which is why I'm using mysql_enable_utf8mb4, correct?
Re "This particular email is encoded using ISO-8859-1 to support the binary characters,", This makes no sense. iso-8859-1 is a way to encode text. It says that a is 0x61, b is 0x62, etc
Re "It also appears Joomla requires UTF-8, which is why I'm using mysql_enable_utf8mb4", This does two things: 1) Sets the text encoding used for communication with the database to UTF-8. This is good, because it means you can exchange any Unicode character with the db. 2) It causes strings coming from the database to be automatically decoded. /// BUT. It does not cause strings passed to the database to be automatically encoded, thus the need to encode showed in the answer. /// It's a poor API because it dates back to before Perl's Unicode support.
Re "It appears you are saying to encode the text, but the example you gave above was to decode it.", You always want to work with decoded text. So you need to decode inputs that are encoded. I said you needed to decode the body of the email, and I did that. (The library that parsed the message should have done that for you, though.) So you need to encode outputs that won't be encoded for you. I said you needed to encode strings passed to the database, and I did that. (Similarly, DBD:;mysql should do that for you, but it doesn't for the historical reason mentioned above.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.