How to extract all text from html excluding css and javascript with lxml in Python?

Question

How can I extract all text from a html excluding any css and javascript?

I am trying the following code:

r = requests.get(website)
tree = html.fromstring(r.text)
html_text = tree.xpath('//text()')

But it also retrieves all css and javascript content from the website

So you want to exclude everything in <script> and <style> tags? — mzjn
– mzjn, Commented Oct 17, 2019 at 14:06
@mzjn Yes, it is right. I want to exclude everything from <script> and <style> extracting only all the readable text from the html — redunicorn
– redunicorn, Commented Oct 17, 2019 at 18:26
How do you define the "readable text" in terms that can be translated into a program in your case? Everything which is not in <script> or <style>? — Valentino
– Valentino, Commented Oct 17, 2019 at 23:30

mzjn · Accepted Answer · 2019-10-18 09:23:29Z

1

You can use the drop_tree() method to remove elements that you are not interested in.

tree = html.fromstring(r.text)

unwanted = tree.xpath('//script|//style')
for u in unwanted:
    u.drop_tree()

html_text = tree.xpath('//text()')

answered Oct 18, 2019 at 9:23

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

1 Answer 1