python-requests, finding the right encoding

Question

I have a problem with reading a web page that didn't specified charset.It contains some non-ascii characters such as euro currency, and my browser is able to read it fine.In firefox, on page info I can see that Encoding used is 'ISO-8859-1' and render mode 'Quirks mode'. However, python-requests can't really decode those non-ascii characters, and I get myself an error when trying to write for example that string to a text file.Example:

result = requests.get(url)
result.encoding = 'ISO-8859-1'
html = result.text
open('textfile.txt', 'w').write(html)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80'

If u'\x80' should represent euro currency in 'ISO-8859-1' encoding, this should work

print '\x80'.decode('ISO-8859-1')

but I get a non-printable character, not euro.

So, how that web page works in a browser, but requests(urllib/2 too) can't handle that encoding? I tried also with 'utf-8' but same thing. Any suggestions?

Community · Accepted Answer · 2017-05-23 12:03:21Z

3

The problem is that the real encoding is cp1252, like you can see if you do this:

 print '\x80'.decode('cp1252')

This related answer gives more detail:

PHP function iconv character encoding from iso-8859-1 to utf-8

It's not related to python, but it's the same problem, and gives some background on why this happens.

edited May 23, 2017 at 12:03

CommunityBot

11 silver badge

answered Feb 28, 2013 at 23:37

pcalcao

16k1 gold badge49 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

python-requests, finding the right encoding

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related