4

I'm trying to install html5lib. at first I tried to install the latest version (8 or 9 nines), but it came into conflict with my BeautifulSoup, so I decided to try older verison (0.9999999, seven nines ). I installed it, but when I try to use it:

>>> with urlopen("http://example.com/") as f:
    document = html5lib.parse(f, encoding=f.info().get_content_charset())

I get an error:

Traceback (most recent call last):
  File "<pyshell#11>", line 2, in <module>
    document = html5lib.parse(f, encoding=f.info().get_content_charset())
  File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 35, in parse
    return p.parse(doc, **kwargs)
  File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "C:\Python\Python35-32\lib\site-packages\html5lib\_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "C:\Python\Python35-32\lib\site-packages\html5lib\_inputstream.py", line 151, in HTMLInputStream
    return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'

What is wrong and what should I do?

0

1 Answer 1

9

I see something was broken in the latest versions of html5lib in regard to bs4, html5lib.treebuilders._base is no longer there, usng bs4 4.4.1 the latest compatible version seems to be the one with 7 nines, once you install it as below it works fine:

 pip3 install -U html5lib=="0.9999999"

Tested using bs4 4.4.1:

In [1]: import bs4

In [2]: bs4.__version__
Out[2]: '4.4.1'

In [3]: import html5lib

In [4]: html5lib.__version__
Out[4]: '0.9999999'

In [5]: from urllib.request import  urlopen

In [6]: with urlopen("http://example.com/") as f:
   ...:         document = html5lib.parse(f, encoding=f.info().get_content_charset())
   ...:     

In [7]: 

You can see the change in this commit Rename treebuilders._base to .base to reflect public status the name was changed:

The error you see is because you are still using the newest version, in html5lib/_inputstream.py, HTMLBinaryInputStream has no encoding arg:

class HTMLBinaryInputStream(HTMLUnicodeInputStream):
    """Provides a unicode stream of characters to the HTMLTokenizer.

    This class takes care of character encoding and removing or replacing
    incorrect byte-sequences and also provides column and line tracking.

    """

    def __init__(self, source, override_encoding=None, transport_encoding=None,
                 same_origin_parent_encoding=None, likely_encoding=None,
                 default_encoding="windows-1252", useChardet=True):

Setting override_encoding=f.info().get_content_charset() should do the trick.

Also upgrading to the latest version of bs4 works fine with the latest version of html5lib:

In [16]: bs4.__version__
Out[16]: '4.5.1'

In [17]: html5lib.__version__
Out[17]: '0.999999999'

In [18]: with urlopen("http://example.com/") as f:
             document = html5lib.parse(f, override_encoding=f.info().get_content_charset())
   ....:     

In [19]: 
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.