Wrong accented characters using Beautiful Soup in Python on a local HTML file

Question

I'm quite familiar with Beautiful Soup in Python, I have always used to scrape live site.

Now I'm scraping a local HTML file (link, in case you want to test the code), the only problem is that accented characters are not represented in the correct way (this never happened to me when scraping live sites).

This is a simplified version of the code

import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('AH.html'), "html.parser")
tables = soup.find_all('table')
titles = tables[0].find_all('tr')
print(titles[55].text)

which prints the following output

2:22 - Il Destino Ãˆ GiÃ Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]

while the correct output should be

2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]

I looked for a solution, read many questions/answers and found this answer, which I implemented in the following way

import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
import codecs

response = open('AH.html')
content = response.read()
html = codecs.decode(content, 'utf-8')
soup = BeautifulSoup(html, "html.parser")

However, it runs the following error

Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\user\Desktop\score.py", line 8, in <module>
    html = codecs.decode(content, 'utf-8')
TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'str')

I guess is easy to solve the problem, but how to do it?

Mark Tolonen · Accepted Answer · 2020-03-18 18:48:30Z

2

Using open('AH.html') decodes the file using a default encoding that may not be the encoding of the file. BeautifulSoup understands the HTML headers, specifically the following content indicates the file is UTF-8-encoded:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Open the file in binary mode and let BeautifulSoup figure it out:

with open("AH.html","rb") as f:
    soup = BeautifulSoup(f, 'html.parser')

Sometimes, websites set the encoding incorrectly. In that case you can specify the encoding yourself if you know what it should be.

with open("AH.html",encoding='utf8') as f:
    soup = BeautifulSoup(f, 'html.parser')

answered Mar 18, 2020 at 18:48

Mark Tolonen

181k26 gold badges183 silver badges279 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

sound wave Over a year ago

Thanks, anyway the last code that you wrote, which solves the problem, was already written by αԋɱҽԃ in the chat (see comments to its answer). But he went even further, and he was able to find the origin of the problem: UTF-8 was not enabled for programs that don't support Unicode (I'm on windows 10). By enabling it, the command encoding="utf8" was no more needed, i.e. the command open("AH.html") does the job.

Mark Tolonen Over a year ago

@soundwave Answers in chat aren’t answers for StackOverfow questions. Changing Windows setting isn’t best answer. Using BeautifulSoup correctly is. also, explicitly specifying the encoding instead of relying on defaults that vary by OS version is a better practice as well.

sound wave Over a year ago

I added to the question the 2 working solutions which are in the chat

Mark Tolonen Over a year ago

@soundwave Answers on StackOverflow should be in the answers section so they can be voted on. Since you've awarded an answer, edit that one so it is actually correct or if not comfortable with that, write your own answer and award it to yourself. But in my opinion having to change Windows settings isn't the correct answer.

sound wave Over a year ago

Oh sorry, I edited the answer adding the two solutions. Thanks.

sound wave · Accepted Answer · 2020-03-19 20:16:18Z

0

from bs4 import BeautifulSoup


with open("AH.html") as f:
    soup = BeautifulSoup(f, 'html.parser')
    tb = soup.find("table")
    for item in tb.find_all("tr")[55]:
        print(item.text)

I've to say, that your first code is actually fine and should works.

Regarding the second code, you are trying to decode str which is faulty. as decode function is for byte object.

I believe that you are using Windows where the default encoding of it is cp1252 not UTF-8.

Could you please run the following code:

import sys

print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.stderr.encoding)

And check your output if it's UTF-8 or cp1252.

note that if you are using VSCode with Code-Runner, kindly run your code in the terminal as py code.py

SOLUTIONS (from the chat)

(1) If you are on windows 10

Open Control Panel and change view by Small icons
Click Region
Click the Administrative tab
Click on Change system locale...
Tick the box "Beta: Use Unicode UTF-8..."
Click OK and restart your pc

(2) If you are not on Windows 10 or just don't want to change the previous setting, then in the first code change open("AH.html") to open("AH.html", encoding="UTF-8"), that is write:

from bs4 import BeautifulSoup

with open("AH.html", encoding="UTF-8") as f:
    soup = BeautifulSoup(f, 'html.parser')
    tb = soup.find("table")
    for item in tb.find_all("tr")[55]:
        print(item.text)

edited Mar 19, 2020 at 20:16

sound wave

3,5673 gold badges14 silver badges36 bronze badges

answered Mar 18, 2020 at 15:01

αԋɱҽԃ αмєяιcαη

11.6k3 gold badges23 silver badges58 bronze badges

5 Comments

sound wave Over a year ago

Thank you, however I get the same wrong output as you can see from here i.imgur.com/aH69dHM.png I guess the problem is my computer?

αԋɱҽԃ αмєяιcαη Over a year ago

@soundwave well, check with with open("AH.html", encoding="utf-8") as f:

αԋɱҽԃ αмєяιcαη Over a year ago

@soundwave I noticed that you have did faulty thing !! please and please. run your code outside Python interpreter !! for the import sys code . put it in a file and then run it as py code.py

sound wave Over a year ago

Excuse me, what do you mean? If I run print(sys.getdefaultencoding()) and others commands outside Python I get errors. I saved the code that you wrote at the beginning of the answer, inside a file called score2.py, but when in the cmd I run both score2.py or py score2.py I got the wrong output

αԋɱҽԃ αмєяιcαη Over a year ago

Let us continue this discussion in chat.

Collectives™ on Stack Overflow

Wrong accented characters using Beautiful Soup in Python on a local HTML file

2 Answers 2

5 Comments

SOLUTIONS (from the chat)

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

SOLUTIONS (from the chat)

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related