0

I'm having troubles with character encoding when writing to files with a script. What I'm doing is downloading some information from a website with an API. I have no control over what format I receive the information in, but here's a quick sample:

{'id': 12, 'name': "Kathy \xc3\x93 Fakename"}
{'id': 23, 'name': "Se\xc3\xb1or Murphy"}

(the names there are "Kathy Ó Fakename" and "Señor Example")

This is mostly fine, when I write these to a generic file with no filetype I get them in the proper format with the correct characters.

However I have 2 problems. I'm writing all this information into a html table. When I'm writing to a file with .html as it's ending, the wrong characters are written to the file. Instead I end up getting the names Kathy Ó Fakename and Señor Example. These incorrect characters are also what show up as the actual filename, even though the corrects I want to be there are perfectly valid for filenames.

I believe I verified that the only difference is the filetype, though I am still confused since I didn't expect Python to implicitly adjust what I wrote. Also it definitely is in the source of the HTML, not just how it displays.

To demonstrate, this code:

with open(os.path.abspath("Test.html"),'w') as f:
    for user in users:
        f.write("{}: {}<br>".format(user['id'], user['name']))
with open(os.path.abspath("Test"),'w') as f:
    for user in users:
        f.write("{}: {}\n".format(user['id'], user['name']))

Results in

Test
12: Kathy Ó Fakename
23: Señor Murphy

Test.html
12: Kathy Ó Fakename<br>
23: Señor Murphy<br>

What's causing the difference here?

2
  • 1
    You are writing UTF-8, but if you are opening the file with a tool that expects Latin 1 or Windows Codepage 1251 then yes, you'll see a Mojibake. Commented Jul 22, 2015 at 11:08
  • @PadraicCunningham: look at the tags. :-) Commented Jul 22, 2015 at 11:09

1 Answer 1

4

You are writing UTF-8 data, but whatever tool you are using to read the files is decoding the files as Windows CP 1251:

>>> print "Kathy \xc3\x93 Fakename".decode('utf8')
Kathy Ó Fakename
>>> print "Kathy \xc3\x93 Fakename".decode('cp1252')
Kathy Ó Fakename
>>> print "Se\xc3\xb1or Murphy".decode('utf8')
Señor Murphy
>>> print "Se\xc3\xb1or Murphy".decode('cp1252')
Señor Murphy

Use the right tools or tell those tools to use UTF-8 instead. When using HTML, you could include a meta tag to tell tools what codec to use:

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
     Kathy Ó Fakename<br />
     Señor Murphy<br />
  </body>
</html>

You may want to read up on Python and Unicode:

Sign up to request clarification or add additional context in comments.

1 Comment

I feel foolish, it hadn't occurred to me I was still looking at the HTML source in the browser instead of Notepad++ (where I was looking at the plain file). So this was exactly my mistake, thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.