5

My input data is as follows (as generated by hexdump):

000000f0  69 61 6e e2 80 99 73 20  65 79 65 73 20 61 62 72  |ian...s eyes abr|

When I open this html () file in Firefox, it displays these characters as:

ian’s eyes abr

According to the link https://superuser.com/questions/1237545/characters-in-email-displayed-like-e2-80-99, "E2 80 99 is the sequence of hex values that encode a right single quotation mark (’) in UTF-8".

This website concurs: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128

When I run this iconv command on the file containing these characters:

iconv -f UTF-8 -t ISO-8859-15 test_chapter.html > blah.html

I get the output:

iconv: illegal input sequence at position 243

and the content of "blah.html" is truncated exactly where the apostrophe would be.

So, to summarise, the internet says that is a valid sequence of bytes for UTF-8, but iconv disagrees.

Can anyone please help me understand what is going on. Is this a bug in iconv?

As a side note, when I use this html file with kindlegen to generate an AZW file, the character is not displayed correctly. All the internet can tell me is that I need to convert the file to UTF-8, but as far as I can tell, it already is!

1
  • 3
    I can convert to UTF-16 and back, so maybe the problem is with converting to ISO-8859-15 rather than converting from UTF-8? Commented Jan 6 at 15:48

2 Answers 2

12

Your comment:

maybe the problem is with converting to ISO-8859-15 rather than converting from UTF-8

is on the right track. The problem is there is no in ISO-8859-15. The most similar character is '. See what man 1 iconv states in Debian 12 I'm using:

If the string //TRANSLIT is appended to to-encoding, characters being converted are transliterated when needed and possible. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similar looking characters. Characters that are outside of the target character set and cannot be transliterated are replaced with a question mark (?) in the output.

Use -t ISO-8859-15//TRANSLIT then.

As a proof of concept, this works for me (in pl_PL.UTF-8 locale):

printf '%s\n' 'ian’s eyes abr' | iconv -f UTF-8 -t ISO-8859-15//TRANSLIT

The output is ian's eyes abr (with a newline at the end). It so happens the representation of this exact string is identical in ISO-8859-15 and in UTF-8, so I chose not to obfuscate the command by additionally piping to iconv -f ISO-8859-15 -t UTF-8.

2
  • 2
    See also the windows-1252 single-byte charset which is a superset of iso8859-1 and since Windows 98 includes all the extra characters defined in iso8859-15 and among others that U+2019 RIGHT SINGLE QUOTATION MARK in the range usually reserved for the C1 control set (U+2019 is encoded as byte 0x92 there) Commented Jan 6 at 19:36
  • Thanks. That TRANSLIT option is ideal. I found that some of my issues were due to the tool I was using ignoring the "meta charset" HTML tag and that converting it to a "meta http-equiv=Content-Type" HTML tag seemed to work initially, but I still had some bad characters. I think converting to Win-1252 with the TRANSLIT option, then back to UTF-8, may solve that. Commented Jan 8 at 7:38
0

Using Raku (formerly known as Perl_6)

~$ printf '%s\n' 'ian’s eyes abr' | raku -pe "tr/’/'/"
ian's eyes abr

#OR

~$ printf '%s\n' 'ian’s eyes abr' | raku -pe 'tr/’/\c[APOSTROPHE]/'
ian's eyes abr

Raku is a programming language in the Perl family that supports Unicode out of the box. Above you can translate a problematic character, or even a set of characters. Below, if you need to decipher some Unicode, you can go character-by-character:

~$ printf 'ian’s eyes abr\n' | raku -ne '.say for .comb.map(*.uniname);'
LATIN SMALL LETTER I
LATIN SMALL LETTER A
LATIN SMALL LETTER N
RIGHT SINGLE QUOTATION MARK
LATIN SMALL LETTER S
SPACE
LATIN SMALL LETTER E
LATIN SMALL LETTER Y
LATIN SMALL LETTER E
LATIN SMALL LETTER S
SPACE
LATIN SMALL LETTER A
LATIN SMALL LETTER B
LATIN SMALL LETTER R

More info below.

https://docs.raku.org/language/unicode
https://docs.raku.org/language/operators#tr///_in-place_transliteration
https://raku.org

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.