How to find out the correct encoding when using beautifulsoup? [duplicate]

Question

In python3 and beautifulsoup4 I want to get information from a website, after making the requests. I did so:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm').text

soup = BeautifulSoup(req,'lxml')

soup.find("h1").text
'\r\n                        CÃ\x82MARA MUNICIPAL DE SÃ\x83O PAULO'

I do not know what the encoding is, but it's a site with Brazilian Portuguese, so it should be utf-8 or latin1

Please, is there a way to find out which encoding is correct?

And then do the beautifulsoup read this encoding correctly?

Look at : How to determine the encoding of text?

Ugo T.
– Ugo T.

2019-05-30 20:44:08 +00:00
Commented May 30, 2019 at 20:44 — Ugo T.
– Ugo T., Commented May 30, 2019 at 20:44

Community · Accepted Answer · 2020-06-20 09:12:55Z

Requests determines encoding like this:

When you receive a response, Requests makes a guess at the encoding to use for decoding the response when you access the Response.text attribute. Requests will first check for an encoding in the HTTP header, and if none is present, will use chardet to attempt to guess the encoding.

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Inspecting the request headers show that indeed "no explicit charset is present in the HTTP headers and the Content-Type header contains text"

>>> req.headers['content-type']
'text/html'

So requests faithfully follows the standard and decodes as ISO-8859-1 (latin-1).

In the response content, a charset is specified:

<META http-equiv="Content-Type" content="text/html; charset=utf-16">

however this is wrong: decoding as UTF-16 produces mojibake.

chardet correctly identifies the encoding as UTF-8.

So to summarise:

there is no general way to determine text encoding with complete accuracy
in this particular case, the correct encoding is UTF-8.

Working code:

>>> req.encoding = 'UTF-8'
>>> soup = bs4.BeautifulSoup(req.text,'lxml')
>>> soup.find('h1').text
'\r\n                        CÂMARA MUNICIPAL DE SÃO PAULO'

simonjansson · Accepted Answer · 2019-05-30 21:15:20Z

1

When you use requests, you can use the encoding function, for example:

req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm')

encoding = req.encoding
text = req.content

decoded_text = text.decode(encoding)

edited May 30, 2019 at 21:15

answered May 30, 2019 at 20:53

simonjansson

631 silver badge10 bronze badges

9 Comments

Reinaldo Chaves Over a year ago

Thank you @simonjansson. Now I am searching how to correctly read the encoding ISO-8859-1

Reinaldo Chaves Over a year ago

Thank you very much @simonjansson, but I still continue to make mistakes. After the above commands, I did: soup = BeautifulSoup(decoded_text, "lxml")

Reinaldo Chaves Over a year ago

It's the result: soup.find("h1") -> <h1> CÃMARA MUNICIPAL DE SÃO PAULO<br/></h1>

Reinaldo Chaves Over a year ago

I'm using Ubuntu, can it be that?

Reinaldo Chaves Over a year ago

One moment, I noticed now that the text on my screen appears wrong, but when I copied it it was correct ...

|

Collectives™ on Stack Overflow

How to find out the correct encoding when using beautifulsoup? [duplicate]

2 Answers 2

Comments

9 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

9 Comments

Linked

Related