1

I have a problem with reading a web page that didn't specified charset.It contains some non-ascii characters such as euro currency, and my browser is able to read it fine.In firefox, on page info I can see that Encoding used is 'ISO-8859-1' and render mode 'Quirks mode'. However, python-requests can't really decode those non-ascii characters, and I get myself an error when trying to write for example that string to a text file.Example:

result = requests.get(url)
result.encoding = 'ISO-8859-1'
html = result.text
open('textfile.txt', 'w').write(html)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80'

If u'\x80' should represent euro currency in 'ISO-8859-1' encoding, this should work

print '\x80'.decode('ISO-8859-1')

but I get a non-printable character, not euro.

So, how that web page works in a browser, but requests(urllib/2 too) can't handle that encoding? I tried also with 'utf-8' but same thing. Any suggestions?

1 Answer 1

3

The problem is that the real encoding is cp1252, like you can see if you do this:

 print '\x80'.decode('cp1252')

This related answer gives more detail:

PHP function iconv character encoding from iso-8859-1 to utf-8

It's not related to python, but it's the same problem, and gives some background on why this happens.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.