3

.html saved to local disk, and I am using BeautifulSoup (bs4) to parse it.

It worked all fine until lately it's changed to Python 3.

I tested the same .html file in another machine Python 2, it works and returned the page contents.

soup = BeautifulSoup(open('page.html'), "lxml")

Machine with Python 3 doesn't work, and it says:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence

Searched around and I tried below but neither worked: (be it 'r', or 'rb' doesn't make big difference)

soup = BeautifulSoup(open('page.html', 'r'), "lxml")
soup = BeautifulSoup(open('page.html', 'r'), 'html.parser')
soup = BeautifulSoup(open('page.html', 'r'), 'html5lib')
soup = BeautifulSoup(open('page.html', 'r'), 'xml')

How can I use Python 3 to parse this html page?

Thank you.

8
  • Sounds like the HTML is probably declaring the wrong encoding. I don't know how you'd override that, though. Commented Oct 9, 2019 at 8:38
  • When you say open('page.html', 'r'), then Python reads the document as plain-text and tries to decode it with some locale-dependent default, which is apparently GBK in your case. lxml should be fine with a binary stream however, so you should try opening it with open('page.html', 'rb'). Or you specify the correct encoding with the encoding= parameter. Note: depending on how the page was saved, the encoding declaration in the document may or may not be correct. Commented Oct 9, 2019 at 8:48
  • @lenz, it says "TypeError: 'from_encoding' is an invalid keyword argument for open()" Commented Oct 9, 2019 at 8:57
  • The parameter is called encoding, not from_encoding. Commented Oct 9, 2019 at 8:59
  • @lenz, it says "ValueError: binary mode doesn't take an encoding argument". Commented Oct 9, 2019 at 9:06

2 Answers 2

2

It worked all fine until lately it's changed to Python 3.

Python 3 has by default strings encoded in unicode, so when you open a file as text it will try to decode it. Python 2, on the other hand, uses bytestrings, instead and just returns the content of the file as-is. Try opening page.html as a byte object (open('page.html', 'rb')) and see if that works for you.

Sign up to request clarification or add additional context in comments.

5 Comments

thanks for the reply. It give 1 more warning, says: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
That's a warning from BeautifulSoup, see here for how to get rid of it: stackoverflow.com/questions/33511544/…
it's an additional warning message. The problem is still there.
@MarkK are you saying you opened the document in binary mode (open(..., 'rb')), and you still get a UnicodeDecodeError?
@GPhilo, it seems the problem wasn't in the BeautifulSoup part. I posted some changes, which helped solved the problem.
1

2 changes I done and not sure which one (or both) took the effect.

The computer was formatted and reinstalled so some settings are different.

1.In the language settings,

Administrative language settings > Change system locale > 

Tick the box

Beta: Use Unicode UTF-8 for worldwide language support

2.on the coding, for example, this is the original line:

print (soup.find_all('span', attrs={'class': 'listing-row__price'})[0].text.strip().encode("utf-8"))

When the part ".encode("utf-8")" was removed, it worked.

  • update on 16th Oct. 2019 Above change works, but when the box is ticked. Fonts and texts in foreign language software doesn't display properly.

    Beta: Use Unicode UTF-8 for worldwide language support
    

When the box was unticked, Fonts and texts in foreign language software are displayed well. But, problem in the question remains.

Solution with the box unticked - both foreign language software and Python codes work:

soup = BeautifulSoup(open(pages, 'r', encoding = 'utf-8', errors='ignore'), "lxml")

2 Comments

The second is the one that "solves" your problem, by simply printing the raw byte string instead of trying to encode it as UTF-8. You still have invalid unicode characters in your text, but if that's not important for your usage, ignoring them is a good option ;)
@GPhilo, however it seems not - when the box "Beta: Use Unicode UTF-8 for worldwide language support" unticked, the problem pops again. (when the When the part ".encode("utf-8")" was removed, it doesn't worked.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.