0

How can I extract all text from a html excluding any css and javascript?

I am trying the following code:

r = requests.get(website)
tree = html.fromstring(r.text)
html_text = tree.xpath('//text()')

But it also retrieves all css and javascript content from the website

4
  • 1
    So you want to exclude everything in <script> and <style> tags? Commented Oct 17, 2019 at 14:06
  • @mzjn Yes, it is right. I want to exclude everything from <script> and <style> extracting only all the readable text from the html Commented Oct 17, 2019 at 18:26
  • 1
    How do you define the "readable text" in terms that can be translated into a program in your case? Everything which is not in <script> or <style>? Commented Oct 17, 2019 at 23:30
  • All text that is not in <script> and <style> tags Commented Oct 18, 2019 at 6:39

1 Answer 1

1

You can use the drop_tree() method to remove elements that you are not interested in.

tree = html.fromstring(r.text)

unwanted = tree.xpath('//script|//style')
for u in unwanted:
    u.drop_tree()

html_text = tree.xpath('//text()') 
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.