Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 49 additions & 2 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,56 @@ Change Log

Released on XXX, 2013

* Implementation updated to implement the `HTML specification
<http://www.whatwg.org/specs/web-apps/current-work/>`_ as of 5th May
2013 (`SVN <http://svn.whatwg.org/webapps/>`_ revision r7867).

* Python 3.2+ supported in a single codebase using the ``six`` library.

* Removed support for Python 2.5 and older.

* Removed the deprecated Beautiful Soup 3 treebuilder.
``beautifulsoup4`` can use ``html5lib`` as a parser instead. Note that
since it doesn't support namespaces, foreign content like SVG and
MathML is parsed incorrectly.

* Removed ``simpletree`` from the package. The default tree builder is
now ``etree`` (using the ``xml.etree.ElementTree/cElementTree``
implementation).
now ``etree`` (using the ``xml.etree.cElementTree`` implementation if
available, and ``xml.etree.ElementTree`` otherwise).

* Removed the ``XHTMLSerializer`` as it never actually guaranteed its
output was well-formed XML, and hence provided little of use.

* Optional heuristic character encoding detection now based on
``charade`` for Python 2.6 - 3.3 compatibility.

* Optional ``Genshi`` treewalker support fixed.

* Many bugfixes, including:

* #33: null in attribute value breaks XML AttValue;

* #4: nested, indirect descendant, <button> causes infinite loop;

* `Google Code 215
<http://code.google.com/p/html5lib/issues/detail?id=215>`_: Properly
detect seekable streams;

* `Google Code 206
<http://code.google.com/p/html5lib/issues/detail?id=206>`_: add
support for <video preload=...>, <audio preload=...>;

* `Google Code 205
<http://code.google.com/p/html5lib/issues/detail?id=205>`_: add
support for <video poster=...>;

* `Google Code 202
<http://code.google.com/p/html5lib/issues/detail?id=202>`_: Unicode
file breaks InputStream.

* Source code is now mostly PEP 8 compliant.

* Test harness has been improved and now depends on ``nose``.


0.95
Expand Down
128 changes: 78 additions & 50 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,63 +1,98 @@
html5lib
========

.. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master
:target: https://travis-ci.org/html5lib/html5lib-python

html5lib is a pure-python library for parsing HTML. It is designed to
conform to the WHATWG HTML specification, as is implemented by all major
web browsers.


Requirements
------------
Usage
-----

Python 2.6 and above as well as Python 3.0 and above are
supported. Implementations known to work are CPython (as the reference
implementation) and PyPy. Jython is known *not* to work due to various
bugs in its implementation of the language. Others such as IronPython
may or may not work; if you wish to try, you are strongly encouraged
to run the testsuite and report back!
Simple usage follows this pattern:

The only required library dependency is ``six``, this can be found
packaged in PyPI.
.. code-block:: python

Optionally:
import html5lib
with open("mydocument.html", "rb") as f:
document = html5lib.parse(f)

- ``datrie`` can be used to improve parsing performance (though in
almost all cases the improvement is marginal);
or:

- ``lxml`` is supported as a tree format (for both building and
walking) under CPython (but *not* PyPy where it is known to cause
segfaults);
.. code-block:: python

- ``genshi`` has a treewalker (but not builder); and
import html5lib
document = html5lib.parse("<p>Hello World!")

- ``charade`` can be used as a fallback when character encoding cannot
be determined; ``chardet``, from which it was forked, can also be used
on Python 2.
By default, the ``document`` will be an ``xml.etree`` element instance.
Whenever possible, html5lib chooses the accelerated ``ElementTree``
implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).

Two other tree types are supported: ``xml.dom.minidom`` and
``lxml.etree``. To use an alternative format, specify the name of
a treebuilder:

.. code-block:: python

import html5lib
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

To have more control over the parser, create a parser object explicitly.
For instance, to make the parser raise exceptions on parse errors, use:

.. code-block:: python

import html5lib
with open("mydocument.html", "rb") as f:
parser = html5lib.HTMLParser(strict=True)
document = parser.parse(f)

When you're instantiating parser objects explicitly, pass a treebuilder
class as the ``tree`` keyword argument to use an alternative document
format:

.. code-block:: python

import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Hello World!")

More documentation is available at http://html5lib.readthedocs.org/.


Installation
------------

html5lib is packaged with distutils. To install it use::
html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,
use:

$ python setup.py install
.. code-block:: bash

$ pip install html5lib

Usage
-----

Simple usage follows this pattern::
Optional Dependencies
---------------------

import html5lib
with open("mydocument.html", "r") as fp:
document = html5lib.parse(f)
The following third-party libraries may be used for additional
functionality:

or::
- ``datrie`` can be used to improve parsing performance (though in
almost all cases the improvement is marginal);

import html5lib
document = html5lib.parse("<p>Hello World!")
- ``lxml`` is supported as a tree format (for both building and
walking) under CPython (but *not* PyPy where it is known to cause
segfaults);

More documentation is available in the docstrings.
- ``genshi`` has a treewalker (but not builder); and

- ``charade`` can be used as a fallback when character encoding cannot
be determined; ``chardet``, from which it was forked, can also be used
on Python 2.


Bugs
Expand All @@ -70,28 +105,21 @@ Please report any bugs on the `issue tracker
Tests
-----

These are contained in the html5lib-tests repository and included as a
submodule, thus for git checkouts they must be initialized (for
release tarballs this is unneeded)::
Unit tests require the ``nose`` library and can be run using the
``nosetests`` command in the root directory. All should pass.

Test data are contained in a separate `html5lib-tests
<https://github.com/html5lib/html5lib-tests>`_ repository and included
as a submodule, thus for git checkouts they must be initialized::

$ git submodule init
$ git submodule update

And then they can be run, with ``nose`` installed, using the
``nosetests`` command in the root directory. All should pass.
This is unneeded for release tarballs.

If you have all compatible Python implementations available on your
system, you can run tests on all of them by using tox::

$ pip install tox
$ tox
...
_______________________ summary ______________________
py26: commands succeeded
py27: commands succeeded
py32: commands succeeded
py33: commands succeeded
congratulations :)
system, you can run tests on all of them using the ``tox`` utility,
which can be found on PyPI.


Contributing
Expand Down Expand Up @@ -121,5 +149,5 @@ Questions?

There's a mailing list available for support on Google Groups,
`html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_,
though you may have more success (and get a far quicker response)
asking on IRC in #whatwg on irc.freenode.net.
though you may get a quicker response asking on IRC in #whatwg on
irc.freenode.net.