I'm quite familiar with Beautiful Soup in Python, I have always used to scrape live site.
Now I'm scraping a local HTML file (link, in case you want to test the code), the only problem is that accented characters are not represented in the correct way (this never happened to me when scraping live sites).
This is a simplified version of the code
import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('AH.html'), "html.parser")
tables = soup.find_all('table')
titles = tables[0].find_all('tr')
print(titles[55].text)
which prints the following output
2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]
while the correct output should be
2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]
I looked for a solution, read many questions/answers and found this answer, which I implemented in the following way
import requests, urllib.request, time, unicodedata, csv
from bs4 import BeautifulSoup
import codecs
response = open('AH.html')
content = response.read()
html = codecs.decode(content, 'utf-8')
soup = BeautifulSoup(html, "html.parser")
However, it runs the following error
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
TypeError: a bytes-like object is required, not 'str'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\user\Desktop\score.py", line 8, in <module>
html = codecs.decode(content, 'utf-8')
TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'str')
I guess is easy to solve the problem, but how to do it?
