How to get text of broken html with lxml

Question

Here's what I have:

r = requests.get("http://www.cnn.com/")
htmlelement = lxml.html.fromstring(r.text)
html = lxml.html.tostring(htmlelement)
tree = lxml.etree.fromstring(html)
print tree.xpath('//*[@id="cnn_maintt1imgbul"]/div/div[2]/div/h1/a')

I thought xml.html corrected the broken html?

The error is:

XMLSyntaxError: Opening and ending tag mismatch: link line 32 and head, line 75, column 8

Thanks!

stackoverflow.com/questions/1922032/…

dstromberg
– dstromberg

2014-04-27 02:39:15 +00:00
Commented Apr 27, 2014 at 2:39 — dstromberg
– dstromberg, Commented Apr 27, 2014 at 2:39

larsks · Accepted Answer · 2014-04-27 02:38:22Z

2

I don't understand why you're trying to reparse the content after getting this far:

>>> htmlelement = lxml.html.fromstring(r.text)

Because at this point you can simply apply your xpath expression:

>>> results = htmlelement.xpath('//*[@id="cnn_maintt1imgbul"]/div/div[2]/div/h1/a')
>>> results
[<Element a at 0x1113a1f50>]
>>> print lxml.html.tostring(results[0])
'<a href="/2014/04/26/world/asia/south-korea-ship-sinking/index.html?hpt=hp_t1" target="">SOUTH KOREAN PRIME MINISTER RESIGNS</a>'

I believe your problem is that lxml.html.tostring() still generates HTML, which you then try to parse with the XML parser.

answered Apr 27, 2014 at 2:38

larsks

318k50 gold badges474 silver badges482 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to get text of broken html with lxml

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related