python requests.get() returns improperly decoded text instead of UTF-8?

Question

When the content-type of the server is 'Content-Type:text/html', requests.get() returns improperly encoded data.

However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8', it returns properly encoded data.

Also, when we use urllib.urlopen(), it returns properly encoded data.

Has anyone noticed this before? Why does requests.get() behave like this?

bubak · Accepted Answer · 2019-08-27 09:22:14Z

131

Educated guesses (mentioned above) are probably just a check for Content-Type header as being sent by server (quite misleading use of educated imho).

For response header Content-Type: text/html the result is ISO-8859-1 (default for HTML4), regardless any content analysis (ie. default for HTML5 is UTF-8).

For response header Content-Type: text/html; charset=utf-8 the result is UTF-8.

Luckily for us, requests uses chardet library and that usually works quite well (attribute requests.Response.apparent_encoding), so you usually want to do:

r = requests.get("https://martin.slouf.name/")
# override encoding by real educated guess as provided by chardet
r.encoding = r.apparent_encoding
# access the data
r.text

edited Aug 27, 2019 at 9:22

answered Oct 2, 2018 at 19:34

bubak

1,7341 gold badge15 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

James Waldby - jwpat7 Over a year ago

The approach with r.encoding = r.apparent_encoding didn't work (é showed up as Ã©) for a web page where line 13 of 374 is <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />. However, changing to r.encoding = 'UTF-8' worked ok. One could have code to search r.text for a "Content-Type" ... charset=... entry, then set r.encoding before accessing r.text further. This would be clunky but more general than just setting the encoding to UTF-8.

bubak Over a year ago

Well, it is a guess after all ;). I suppose you realize that r.apparent_encoding value is set by chardet library -- and of course -- it can guess wrong. You should also be aware that you should not access r.text before setting the r.encoding to desired value (using r.apparent_encoding or any method desirable). I recommend reading the chardet library docs (chardet.readthedocs.io/en/latest), if you are attempting to guess it your way -- it can offer a solution you seek.

James Waldby - jwpat7 Over a year ago

ok. Note, re "should not access r.text before setting the r.encoding to desired value", some doc I looked at (and now can't find) gave impression it is ok to repeatedly set different encodings and then access .text if you want to see different encodings. ¶ But a doc looked at just now implies that's not so. ¶ Re chardet, I see it has methods that would be less ad hoc than searching for a charset=... entry. Thanks!

Matt Welke Over a year ago

This was a great solution for me. I was using requests and Beautiful Soup to do web scraping. At first I thought the issue was with Beautiful Soup and I was ready to dive into its documentation to figure what it does with respect to UTF-8. Before that though, I checked the string returned with .text on my response object. It had the badly-encoded characters. In my case, it looked like 19% Â± 3%â\x96¼ for text that should actually be 19% ± 3%▼. encoding was "ISO-8859-1" and apparent_encoding was "UTF-8". By setting encoding to apparent_encoding, then getting text, it worked.

AcK · Accepted Answer · 2024-04-25 19:57:40Z

66

From requests documentation:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'

Check the encoding requests used for your page, and if it's not the right one - try to force it to be the one you need.

Regarding the differences between requests and urllib.urlopen - they probably use different ways to guess the encoding. Thats all.

edited Apr 25, 2024 at 19:57

AcK

2,2362 gold badges25 silver badges32 bronze badges

answered May 26, 2017 at 13:59

Dekel

62.9k12 gold badges109 silver badges130 bronze badges

2 Comments

Michael H. Over a year ago

Link not working. This is the new one: requests.readthedocs.io/en/latest/user/quickstart/…

AcK Over a year ago

Link is ok now.

Hari_pb · Accepted Answer · 2020-01-30 18:21:25Z

38

After getting response, take response.content instead of response.text and that will be of encoding utf-8.

response = requests.get(download_link, auth=(myUsername, myPassword),  headers={'User-Agent': 'Mozilla'})
print (response.encoding)
if response.status_code is 200:
    body = response.content
else:
    print ("Unable to get response with Code : %d " % (response.status_code))

answered Jan 30, 2020 at 18:21

Hari_pb

7,4564 gold badges49 silver badges54 bronze badges

2 Comments

Ivan Over a year ago

This is a lifesaver for me too!

agubelu Over a year ago

This answer isn't quite correct. The difference between both is that response.content is of type bytes, since it's the raw sequence of bytes returned as the server's response. If you are sure that the response is in UTF-8 then you can turn it into a string with response.content.decode('utf-8'), but this will fail if the response is not in fact valid UTF-8. In contrast, response.text is a str and, as commented above, is the result of automatically decoding response.content using whatever encoding Requests assumes to be the correct one based on the HTTP headers.

glhr · Accepted Answer · 2019-04-20 09:46:20Z

27

The default assumed content encoding for text/html is ISO-8859-1 aka Latin-1 :( See RFC-2854. UTF-8 was too young to become the default, it was born in 1993, about the same time as HTML and HTTP.

Use .content to access the byte stream, or .text to access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .text may be off.

edited Apr 20, 2019 at 9:46

glhr

4,5471 gold badge18 silver badges27 bronze badges

answered May 26, 2017 at 14:05

9000

41k9 gold badges69 silver badges108 bronze badges

3 Comments

Adolfo F. Ibarra Landeo Over a year ago

In my case, this was the answer. The answer given by @bubak worked, but it has bad performance for all the transformations. content is the key

Brian S Over a year ago

I was able to do something like this, to make sure if we could not convert to what I wanted, we at least got something. I also found that the content was much faster to process then setting the encoding. try: lContent = lResponse.content.decode('UTF-8') except: lContent = lResponse.content.decode(lResponse.apparent_encoding)

Marco Aurelio Fernandez Reyes Over a year ago

Using .content did the trick/worked for me +1

Collectives™ on Stack Overflow

python requests.get() returns improperly decoded text instead of UTF-8?

4 Answers 4

4 Comments

2 Comments

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

2 Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related