iconv fails to detect valid utf-8 character as utf-8

Question

My input data is as follows (as generated by hexdump):

000000f0  69 61 6e e2 80 99 73 20  65 79 65 73 20 61 62 72  |ian...s eyes abr|

When I open this html () file in Firefox, it displays these characters as:

ian’s eyes abr

According to the link https://superuser.com/questions/1237545/characters-in-email-displayed-like-e2-80-99, "E2 80 99 is the sequence of hex values that encode a right single quotation mark (’) in UTF-8".

This website concurs: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128

When I run this iconv command on the file containing these characters:

iconv -f UTF-8 -t ISO-8859-15 test_chapter.html > blah.html

I get the output:

iconv: illegal input sequence at position 243

and the content of "blah.html" is truncated exactly where the apostrophe would be.

So, to summarise, the internet says that is a valid sequence of bytes for UTF-8, but iconv disagrees.

Can anyone please help me understand what is going on. Is this a bug in iconv?

As a side note, when I use this html file with kindlegen to generate an AZW file, the character is not displayed correctly. All the internet can tell me is that I need to convert the file to UTF-8, but as far as I can tell, it already is!

I can convert to UTF-16 and back, so maybe the problem is with converting to ISO-8859-15 rather than converting from UTF-8? — AlastairG
– AlastairG, Commented Jan 6 at 15:48

Kamil Maciorowski · Accepted Answer · 2025-01-06 17:21:39Z

12

Your comment:

maybe the problem is with converting to ISO-8859-15 rather than converting from UTF-8

is on the right track. The problem is there is no ’ in ISO-8859-15. The most similar character is '. See what man 1 iconv states in Debian 12 I'm using:

If the string //TRANSLIT is appended to to-encoding, characters being converted are transliterated when needed and possible. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similar looking characters. Characters that are outside of the target character set and cannot be transliterated are replaced with a question mark (?) in the output.

Use -t ISO-8859-15//TRANSLIT then.

As a proof of concept, this works for me (in pl_PL.UTF-8 locale):

printf '%s\n' 'ian’s eyes abr' | iconv -f UTF-8 -t ISO-8859-15//TRANSLIT

The output is ian's eyes abr (with a newline at the end). It so happens the representation of this exact string is identical in ISO-8859-15 and in UTF-8, so I chose not to obfuscate the command by additionally piping to iconv -f ISO-8859-15 -t UTF-8.

edited Jan 6 at 17:21

answered Jan 6 at 16:06

Kamil Maciorowski

24.5k2 gold badges69 silver badges129 bronze badges

2

See also the windows-1252 single-byte charset which is a superset of iso8859-1 and since Windows 98 includes all the extra characters defined in iso8859-15 and among others that U+2019 RIGHT SINGLE QUOTATION MARK in the range usually reserved for the C1 control set (U+2019 is encoded as byte 0x92 there)

Stéphane Chazelas
– Stéphane Chazelas

2025-01-06 19:36:09 +00:00
Commented Jan 6 at 19:36
Thanks. That TRANSLIT option is ideal. I found that some of my issues were due to the tool I was using ignoring the "meta charset" HTML tag and that converting it to a "meta http-equiv=Content-Type" HTML tag seemed to work initially, but I still had some bad characters. I think converting to Win-1252 with the TRANSLIT option, then back to UTF-8, may solve that.

AlastairG
– AlastairG

2025-01-08 07:38:09 +00:00
Commented Jan 8 at 7:38

Add a comment |

jubilatious1 · Accepted Answer · 2025-01-11 00:48:21Z

Using Raku (formerly known as Perl_6)

~$ printf '%s\n' 'ian’s eyes abr' | raku -pe "tr/’/'/"
ian's eyes abr

#OR

~$ printf '%s\n' 'ian’s eyes abr' | raku -pe 'tr/’/\c[APOSTROPHE]/'
ian's eyes abr

Raku is a programming language in the Perl family that supports Unicode out of the box. Above you can translate a problematic character, or even a set of characters. Below, if you need to decipher some Unicode, you can go character-by-character:

~$ printf 'ian’s eyes abr\n' | raku -ne '.say for .comb.map(*.uniname);'
LATIN SMALL LETTER I
LATIN SMALL LETTER A
LATIN SMALL LETTER N
RIGHT SINGLE QUOTATION MARK
LATIN SMALL LETTER S
SPACE
LATIN SMALL LETTER E
LATIN SMALL LETTER Y
LATIN SMALL LETTER E
LATIN SMALL LETTER S
SPACE
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER R

More info below.

https://docs.raku.org/language/unicode
https://docs.raku.org/language/operators#tr///_in-place_transliteration
https://raku.org

Stack Exchange Network

iconv fails to detect valid utf-8 character as utf-8

2 Answers 2

You must log in to answer this question.

Hot Network Questions

iconv fails to detect valid utf-8 character as utf-8

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions