4

I am writing a simple script to fetch the big grey table from here.

The code I have is the following:

import urllib2
from lxml import etree

html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read()

root = etree.XML(html)

But I am getting an error on the last statement.

Traceback (most recent call last):
  File "D:\Workspace\afi100\afi100.py", line 13, in <module>
    root = etree.XML(html)
  File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577)
  File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
  File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
  File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
  File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
  File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
  File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71683)
XMLSyntaxError: Space required after the Public Identifier, line 3, column 59

Any idea how can I get around this error?

Thanks.

3
  • 1
    You think it is a good idea to parse HTML using an XML parser? Commented Dec 6, 2010 at 21:15
  • You should any available HTML to XML (xhtml) tool. Commented Dec 6, 2010 at 21:16
  • I was under the false impression that HTML was a subset of XML (it's not, but XHTML is). There's a good description of the major differences at techforum4u.com/content.php/… Commented Nov 21, 2013 at 22:38

2 Answers 2

10

You're trying to parse HTML with the XML parser, you should use the lxml HTML parser.

import urllib2
from StringIO import StringIO
from lxml import etree

ufile = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx")

root = etree.parse(ufile, etree.HTMLParser())

print etree.tostring(root)
Sign up to request clarification or add additional context in comments.

1 Comment

Interesting, is that a true HTML parser or does it only set libxml2's recovery flag?
1

The document you link to is not well-formed XHTML, therefore you can't use an XML parser to load it.

You have to use an HTML parser like Beautiful Soup instead.

3 Comments

Thansk for the reply. Would libxml2dom work instead? I have used it before.
@nunos, probably not, since it's a binding to the libxml2 library which, to my knowledge, only reliably supports well-formed XML.
whilst you can use Beautiful Soup, lxml can also handle HTML (see koblas' accepted answer).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.