I’m working on improving the character encoding support for a Python IRC bot that retrieves the titles of pages whose URLs are mentioned in a channel.
The current process I’m using is as follows:
-
r = requests.get(url, headers={ 'User-Agent': '...' }) -
soup = bs4.BeautifulSoup(r.text, from_encoding=r.encoding) title = soup.title.string.replace('\n', ' ').replace(...)etc.
Specifying from_encoding=r.encoding is a good start, because it allows us to heed the charset from the Content-Type header when parsing the page.
Where this falls on its face is with pages that specify a <meta http-equiv … charset=…"> or <meta charset="…"> instead (or on top) of a charset in their Content-Type header.
The approaches I currently see from here are as follows:
- Use Unicode, Dammit unconditionally when parsing the page. This is the default, but it seems to be ineffective for any of the pages that I’ve been testing it with.
- Use ftfy unconditionally before or after parsing the page. I’m not fond of this option, because it basically relies on guesswork for a task for which we (usually) have perfect information.
- Write code to look for an appropriate
<meta>tag, try to heed any encodings we find there, then fall back on Requests’.encoding, possibly in combination with the previous option. I find this option ideal, but I’d rather not write this code if it already exists.
TL;DR is there a Right Way™ to make Beautiful Soup correctly heed the character encoding of arbitrary HTML pages on the web, using a similar technique to what browsers use?