I'm trying to extract some text from a webpage using lxml and xpath - there are two bits I need
the main text body:
page = requests.get(url)
pageopen = lxml.html.fromstring(page)
body_one = pageopen.xpath('/html/body//div/div/div//div/p[@class="body"]/text()')
which is working fine
The second body of text (which only reveals after a mouse click) I have managed to get using
pageopen.xpath('/html/body//div/div/div//div//span/@data-description')
but the text returned still has html junk in it.
Using the /text() function on the above statement returns an empty list.
I've spent hours reading the lxml documentation but its all Greek to me.
How do I strip html tags from an xpath @attribute?
//p[@class="body"]/text(). You're making it too complicated. The second one, confirm that it's not using Javascript to get the value of the tag. If it doesn't just strip it usingfromstringas well.