strip html tags from an xpath @attribute

Question

I'm trying to extract some text from a webpage using lxml and xpath - there are two bits I need

the main text body:

page = requests.get(url)
pageopen = lxml.html.fromstring(page)

body_one = pageopen.xpath('/html/body//div/div/div//div/p[@class="body"]/text()')

which is working fine

The second body of text (which only reveals after a mouse click) I have managed to get using

pageopen.xpath('/html/body//div/div/div//div//span/@data-description')

but the text returned still has html junk in it.

Using the /text() function on the above statement returns an empty list.

I've spent hours reading the lxml documentation but its all Greek to me.

How do I strip html tags from an xpath @attribute?

First off, check if you can reduce the first xpath to //p[@class="body"]/text(). You're making it too complicated. The second one, confirm that it's not using Javascript to get the value of the tag. If it doesn't just strip it using fromstring as well. — WGS
– WGS, Commented Jun 9, 2014 at 5:49

Joe · Accepted Answer · 2014-06-09 05:48:58Z

1

but the text returned still has html junk in it

If you mean that the string is HTML, use the technique you already understand for extracting text from HTML:

descriptionHtml = pageopen.xpath('/html/body//div/div/div//div//span/@data-description')
descriptionBody = lxml.html.fromstring(descriptionHtml)
descriptionText = descriptionBody.xpath('text()')

answered Jun 9, 2014 at 5:48

Joe

31.4k13 gold badges77 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

reaco Over a year ago

Thanks that worked. I just had to select the first element: descriptionHtml = pageopen.xpath('/html/body//div/div/div//div//span/@data-description')[0] otherwise it threw an error

user3180 Over a year ago

Why does this method remove numbers in the text?

Collectives™ on Stack Overflow

strip html tags from an xpath @attribute

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related