0

I'm trying to extract some text from a webpage using lxml and xpath - there are two bits I need

the main text body:

page = requests.get(url)
pageopen = lxml.html.fromstring(page)

body_one = pageopen.xpath('/html/body//div/div/div//div/p[@class="body"]/text()')

which is working fine

The second body of text (which only reveals after a mouse click) I have managed to get using

pageopen.xpath('/html/body//div/div/div//div//span/@data-description')

but the text returned still has html junk in it.

Using the /text() function on the above statement returns an empty list.

I've spent hours reading the lxml documentation but its all Greek to me.

How do I strip html tags from an xpath @attribute?

1
  • First off, check if you can reduce the first xpath to //p[@class="body"]/text(). You're making it too complicated. The second one, confirm that it's not using Javascript to get the value of the tag. If it doesn't just strip it using fromstring as well. Commented Jun 9, 2014 at 5:49

1 Answer 1

1

but the text returned still has html junk in it

If you mean that the string is HTML, use the technique you already understand for extracting text from HTML:

descriptionHtml = pageopen.xpath('/html/body//div/div/div//div//span/@data-description')
descriptionBody = lxml.html.fromstring(descriptionHtml)
descriptionText = descriptionBody.xpath('text()')
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks that worked. I just had to select the first element: descriptionHtml = pageopen.xpath('/html/body//div/div/div//div//span/@data-description')[0] otherwise it threw an error
Why does this method remove numbers in the text?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.