.html saved to local disk, and I am using BeautifulSoup (bs4) to parse it.
It worked all fine until lately it's changed to Python 3.
I tested the same .html file in another machine Python 2, it works and returned the page contents.
soup = BeautifulSoup(open('page.html'), "lxml")
Machine with Python 3 doesn't work, and it says:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence
Searched around and I tried below but neither worked: (be it 'r', or 'rb' doesn't make big difference)
soup = BeautifulSoup(open('page.html', 'r'), "lxml")
soup = BeautifulSoup(open('page.html', 'r'), 'html.parser')
soup = BeautifulSoup(open('page.html', 'r'), 'html5lib')
soup = BeautifulSoup(open('page.html', 'r'), 'xml')
How can I use Python 3 to parse this html page?
Thank you.
open('page.html', 'r'), then Python reads the document as plain-text and tries to decode it with some locale-dependent default, which is apparently GBK in your case.lxmlshould be fine with a binary stream however, so you should try opening it withopen('page.html', 'rb'). Or you specify the correct encoding with theencoding=parameter. Note: depending on how the page was saved, the encoding declaration in the document may or may not be correct.encoding, notfrom_encoding.