Remove all javascript tags and style tags from html with python and the lxml module

Question

I am parsing an html document using the http://lxml.de/ library. So far I have figured out how to strip tags from an html document In lxml, how do I remove a tag but retain all contents? but the method described in that post leaves all the text, stripping the tags with out removing the actual script. I have also found a class reference to lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html but this is clear as mud as to how to actually use the class to clean the document. Any help, perhaps a short example would be helpful to me!

Martin Thoma · Accepted Answer · 2019-08-10 13:13:14Z

75

Below is an example to do what you want. For an HTML document, Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the <script> tag; you also want to get rid of things like onclick=function() attributes on other tags.

#!/usr/bin/env python

import lxml
from lxml.html.clean import Cleaner

cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

print("WITH JAVASCRIPT & STYLES")
print(lxml.html.tostring(lxml.html.parse('http://www.google.com')))
print("WITHOUT JAVASCRIPT & STYLES")
print(lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com'))))

You can get a list of the options you can set in the lxml.html.clean.Cleaner documentation; some options you can just set to True or False (the default) and others take a list like:

cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']

Note that the difference between kill vs remove:

remove_tags:
  A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
  A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
  A list of tags to include (default include all).

edited Aug 10, 2019 at 13:13

Martin Thoma

139k174 gold badges687 silver badges1.1k bronze badges

answered Dec 18, 2011 at 19:37

aculich

14.9k9 gold badges67 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

john-charles Over a year ago

I've been out most of the day, should have brought this up earlier I guess. I just noticed after playing with this that the kill_tags thing doesn't seem to actually do anything for example I added cleaner.kill_tags = ('img','noscript','a') but those tags remain in the output document, the rest of the example above works as expected, it's just after playing with kill tags that I noticed this.

aculich Over a year ago

Notice in my example I use square brackets, not parentheses. You should try ['img','noscript','a']. The square brackets denote a list, whereas the parentheses denote a tuple (in your example a 3-element tuple). Tuples and lists are not the same at all.

john-charles Over a year ago

I tried both list and tuple, notations the effect is the same, the tags are not removed. After some further research I believe this is a bug in the version of lxml/html/clean.py distributed with ubuntu. Note at line 253 of lxml.de/api/lxml.html.clean-pysrc.html kill_tags is initialized to kill_tags = set(self.kill_tags or ()) in the version of clean.py shipped with Ubuntu its just initialized to kill_tags = set(). Rendering it ineffectual. Thanks I will notify the package maintainer.

Ursa Major Over a year ago

it is not working for url like : blog.cryptographyengineering.com/2016/03/…

benzkji Over a year ago

cleaner really is the thing!

Asher · Accepted Answer · 2020-07-02 03:02:00Z

Here are some examples of how to remove and parse different types of HTML elements from a XML/HTML tree.

KEY SUGGESTION: Its helpful to NOT depend on external libraries and do everything within "native python 2/3 code".

Here are some examples of how to do this with "native" python...

# (REMOVE <SCRIPT> to </script> and variations)
pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <STYLE> to </style> and variations)
pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <META> to </meta> and variations)
pattern = r'<[ ]*meta.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML COMMENTS <!-- to --> and variations)
pattern = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML DOCTYPE <!DOCTYPE html to > and variations)
pattern = r'<[ ]*\![ ]*DOCTYPE.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

NOTE:

re.IGNORECASE # is needed to match case sensitive <script> or <SCRIPT> or <Script>
re.MULTILINE # is needed to match newlines
re.DOTALL # is needed to match "special characters" and match "any character"

I've tested this out on several different HTML files and including , , and and it works "fast" and works across newlines!..

NOTE: It also does NOT depend on beautifulsoup or any other external downloaded library!

Hope this helps!

:)

cenanozen · Accepted Answer · 2011-12-18 19:11:05Z

4

You can use the strip_elements method to remove scripts, then use strip_tags method to remove other tags:

etree.strip_elements(fragment, 'script')
etree.strip_tags(fragment, 'a', 'p') # and other tags that you want to remove

answered Dec 18, 2011 at 19:11

cenanozen

1,13118 silver badges29 bronze badges

2 Comments

aculich Over a year ago

For an HTML document when removing scripts you want to get rid of ALL the javascript, not just the <script> tags themselves, so Cleaner is a better general solution, though strip_elements is fine for an XML document.

aculich Over a year ago

Thanks... your answer is still a good solution for XML documents, so I added some text in my answer to clarify the XML vs HTML use cases.

Hafiz Muhammad Shafiq · Accepted Answer · 2017-01-13 05:17:35Z

3

You can use bs4 libray also for this purpose.

soup = BeautifulSoup(html_src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]

answered Jan 13, 2017 at 5:17

Hafiz Muhammad Shafiq

8,72613 gold badges70 silver badges140 bronze badges

2 Comments

Andy Hayden Over a year ago

surely this does the opposite / what do you do with this list?

havlock Over a year ago

No, because it changes soup in the process. Ie soup no longer has these tags

Flair · Accepted Answer · 2021-12-30 08:13:22Z

0

You can use Regular Expression With Ease

For Javasript

def remove_script_code(data):
    clean = re.compile('<script>.*?</script>')
    return [re.sub(clean, '', data)]

For CSS Style

def remove_style_code(data):
    clean = re.compile('<style>.*?</style>')
    return [re.sub(clean, '', data)]

edited Dec 30, 2021 at 8:13

Flair

2,9572 gold badges33 silver badges45 bronze badges

answered Dec 29, 2021 at 12:12

sudeep kharel

11 bronze badge

Collectives™ on Stack Overflow

Remove all javascript tags and style tags from html with python and the lxml module

5 Answers 5

5 Comments

Comments

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

5 Comments

Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related