Parsing HTML: lxml error in Python

Question

I am writing a simple script to fetch the big grey table from here.

The code I have is the following:

import urllib2
from lxml import etree

html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read()

root = etree.XML(html)

But I am getting an error on the last statement.

Traceback (most recent call last):
  File "D:\Workspace\afi100\afi100.py", line 13, in <module>
    root = etree.XML(html)
  File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577)
  File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
  File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
  File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
  File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
  File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
  File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71683)
XMLSyntaxError: Space required after the Public Identifier, line 3, column 59

Any idea how can I get around this error?

Thanks.

You think it is a good idea to parse HTML using an XML parser? — khachik
– khachik, Commented Dec 6, 2010 at 21:15
I was under the false impression that HTML was a subset of XML (it's not, but XHTML is). There's a good description of the major differences at techforum4u.com/content.php/… — naught101
– naught101, Commented Nov 21, 2013 at 22:38

koblas · Accepted Answer · 2010-12-06 21:22:26Z

10

You're trying to parse HTML with the XML parser, you should use the lxml HTML parser.

import urllib2
from StringIO import StringIO
from lxml import etree

ufile = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx")

root = etree.parse(ufile, etree.HTMLParser())

print etree.tostring(root)

answered Dec 6, 2010 at 21:22

koblas

27.4k6 gold badges42 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Frédéric Hamidi Over a year ago

Interesting, is that a true HTML parser or does it only set libxml2's recovery flag?

Frédéric Hamidi · Accepted Answer · 2010-12-06 21:18:23Z

1

The document you link to is not well-formed XHTML, therefore you can't use an XML parser to load it.

You have to use an HTML parser like Beautiful Soup instead.

answered Dec 6, 2010 at 21:18

Frédéric Hamidi

264k42 gold badges497 silver badges486 bronze badges

3 Comments

nunos Over a year ago

Thansk for the reply. Would libxml2dom work instead? I have used it before.

Frédéric Hamidi Over a year ago

@nunos, probably not, since it's a binding to the libxml2 library which, to my knowledge, only reliably supports well-formed XML.

Bala Clark Over a year ago

whilst you can use Beautiful Soup, lxml can also handle HTML (see koblas' accepted answer).

Collectives™ on Stack Overflow

Parsing HTML: lxml error in Python

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related