1

Here's what I have:

r = requests.get("http://www.cnn.com/")
htmlelement = lxml.html.fromstring(r.text)
html = lxml.html.tostring(htmlelement)
tree = lxml.etree.fromstring(html)
print tree.xpath('//*[@id="cnn_maintt1imgbul"]/div/div[2]/div/h1/a')

I thought xml.html corrected the broken html?

The error is:

XMLSyntaxError: Opening and ending tag mismatch: link line 32 and head, line 75, column 8

Thanks!

1

1 Answer 1

2

I don't understand why you're trying to reparse the content after getting this far:

>>> htmlelement = lxml.html.fromstring(r.text)

Because at this point you can simply apply your xpath expression:

>>> results = htmlelement.xpath('//*[@id="cnn_maintt1imgbul"]/div/div[2]/div/h1/a')
>>> results
[<Element a at 0x1113a1f50>]
>>> print lxml.html.tostring(results[0])
'<a href="/2014/04/26/world/asia/south-korea-ship-sinking/index.html?hpt=hp_t1" target="">SOUTH KOREAN PRIME MINISTER RESIGNS</a>'

I believe your problem is that lxml.html.tostring() still generates HTML, which you then try to parse with the XML parser.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.