Writing Unicode to HTML File Differs from Plain File

Question

I'm having troubles with character encoding when writing to files with a script. What I'm doing is downloading some information from a website with an API. I have no control over what format I receive the information in, but here's a quick sample:

{'id': 12, 'name': "Kathy \xc3\x93 Fakename"}
{'id': 23, 'name': "Se\xc3\xb1or Murphy"}

(the names there are "Kathy Ó Fakename" and "Señor Example")

This is mostly fine, when I write these to a generic file with no filetype I get them in the proper format with the correct characters.

However I have 2 problems. I'm writing all this information into a html table. When I'm writing to a file with .html as it's ending, the wrong characters are written to the file. Instead I end up getting the names Kathy Ã“ Fakename and SeÃ±or Example. These incorrect characters are also what show up as the actual filename, even though the corrects I want to be there are perfectly valid for filenames.

I believe I verified that the only difference is the filetype, though I am still confused since I didn't expect Python to implicitly adjust what I wrote. Also it definitely is in the source of the HTML, not just how it displays.

To demonstrate, this code:

with open(os.path.abspath("Test.html"),'w') as f:
    for user in users:
        f.write("{}: {}<br>".format(user['id'], user['name']))
with open(os.path.abspath("Test"),'w') as f:
    for user in users:
        f.write("{}: {}\n".format(user['id'], user['name']))

Results in

Test
12: Kathy Ó Fakename
23: Señor Murphy

Test.html
12: Kathy Ã“ Fakename<br>
23: SeÃ±or Murphy<br>

What's causing the difference here?

You are writing UTF-8, but if you are opening the file with a tool that expects Latin 1 or Windows Codepage 1251 then yes, you'll see a Mojibake. — Martijn Pieters
– Martijn Pieters, Commented Jul 22, 2015 at 11:08

Community · Accepted Answer · 2017-05-23 12:14:09Z

4

You are writing UTF-8 data, but whatever tool you are using to read the files is decoding the files as Windows CP 1251:

>>> print "Kathy \xc3\x93 Fakename".decode('utf8')
Kathy Ó Fakename
>>> print "Kathy \xc3\x93 Fakename".decode('cp1252')
Kathy Ã“ Fakename
>>> print "Se\xc3\xb1or Murphy".decode('utf8')
Señor Murphy
>>> print "Se\xc3\xb1or Murphy".decode('cp1252')
SeÃ±or Murphy

Use the right tools or tell those tools to use UTF-8 instead. When using HTML, you could include a meta tag to tell tools what codec to use:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
     Kathy Ó Fakename<br />
     Señor Murphy<br />
  </body>
</html>

You may want to read up on Python and Unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO

edited May 23, 2017 at 12:14

CommunityBot

11 silver badge

answered Jul 22, 2015 at 11:09

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

SuperBiasedMan Over a year ago

I feel foolish, it hadn't occurred to me I was still looking at the HTML source in the browser instead of Notepad++ (where I was looking at the plain file). So this was exactly my mistake, thank you!

Collectives™ on Stack Overflow

Writing Unicode to HTML File Differs from Plain File

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related