37

I am parsing an html document using the http://lxml.de/ library. So far I have figured out how to strip tags from an html document In lxml, how do I remove a tag but retain all contents? but the method described in that post leaves all the text, stripping the tags with out removing the actual script. I have also found a class reference to lxml.html.clean.Cleaner http://lxml.de/api/lxml.html.clean.Cleaner-class.html but this is clear as mud as to how to actually use the class to clean the document. Any help, perhaps a short example would be helpful to me!

5 Answers 5

75

Below is an example to do what you want. For an HTML document, Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the <script> tag; you also want to get rid of things like onclick=function() attributes on other tags.

#!/usr/bin/env python

import lxml
from lxml.html.clean import Cleaner

cleaner = Cleaner()
cleaner.javascript = True # This is True because we want to activate the javascript filter
cleaner.style = True      # This is True because we want to activate the styles & stylesheet filter

print("WITH JAVASCRIPT & STYLES")
print(lxml.html.tostring(lxml.html.parse('http://www.google.com')))
print("WITHOUT JAVASCRIPT & STYLES")
print(lxml.html.tostring(cleaner.clean_html(lxml.html.parse('http://www.google.com'))))

You can get a list of the options you can set in the lxml.html.clean.Cleaner documentation; some options you can just set to True or False (the default) and others take a list like:

cleaner.kill_tags = ['a', 'h1']
cleaner.remove_tags = ['p']

Note that the difference between kill vs remove:

remove_tags:
  A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.
kill_tags:
  A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself.
allow_tags:
  A list of tags to include (default include all).
Sign up to request clarification or add additional context in comments.

5 Comments

I've been out most of the day, should have brought this up earlier I guess. I just noticed after playing with this that the kill_tags thing doesn't seem to actually do anything for example I added cleaner.kill_tags = ('img','noscript','a') but those tags remain in the output document, the rest of the example above works as expected, it's just after playing with kill tags that I noticed this.
Notice in my example I use square brackets, not parentheses. You should try ['img','noscript','a']. The square brackets denote a list, whereas the parentheses denote a tuple (in your example a 3-element tuple). Tuples and lists are not the same at all.
I tried both list and tuple, notations the effect is the same, the tags are not removed. After some further research I believe this is a bug in the version of lxml/html/clean.py distributed with ubuntu. Note at line 253 of lxml.de/api/lxml.html.clean-pysrc.html kill_tags is initialized to kill_tags = set(self.kill_tags or ()) in the version of clean.py shipped with Ubuntu its just initialized to kill_tags = set(). Rendering it ineffectual. Thanks I will notify the package maintainer.
it is not working for url like : blog.cryptographyengineering.com/2016/03/…
cleaner really is the thing!
7

Here are some examples of how to remove and parse different types of HTML elements from a XML/HTML tree.

KEY SUGGESTION: Its helpful to NOT depend on external libraries and do everything within "native python 2/3 code".

Here are some examples of how to do this with "native" python...

# (REMOVE <SCRIPT> to </script> and variations)
pattern = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <STYLE> to </style> and variations)
pattern = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <META> to </meta> and variations)
pattern = r'<[ ]*meta.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML COMMENTS <!-- to --> and variations)
pattern = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML DOCTYPE <!DOCTYPE html to > and variations)
pattern = r'<[ ]*\![ ]*DOCTYPE.*?>'  # mach any char zero or more times
text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

NOTE:

re.IGNORECASE # is needed to match case sensitive <script> or <SCRIPT> or <Script>
re.MULTILINE # is needed to match newlines
re.DOTALL # is needed to match "special characters" and match "any character" 

I've tested this out on several different HTML files and including , , and and it works "fast" and works across newlines!..

NOTE: It also does NOT depend on beautifulsoup or any other external downloaded library!

Hope this helps!

:)

Comments

4

You can use the strip_elements method to remove scripts, then use strip_tags method to remove other tags:

etree.strip_elements(fragment, 'script')
etree.strip_tags(fragment, 'a', 'p') # and other tags that you want to remove

2 Comments

For an HTML document when removing scripts you want to get rid of ALL the javascript, not just the <script> tags themselves, so Cleaner is a better general solution, though strip_elements is fine for an XML document.
Thanks... your answer is still a good solution for XML documents, so I added some text in my answer to clarify the XML vs HTML use cases.
3

You can use bs4 libray also for this purpose.

soup = BeautifulSoup(html_src, "lxml")
[x.extract() for x in soup.findAll(['script', 'style'])]

2 Comments

surely this does the opposite / what do you do with this list?
No, because it changes soup in the process. Ie soup no longer has these tags
0

You can use Regular Expression With Ease

For Javasript

def remove_script_code(data):
    clean = re.compile('<script>.*?</script>')
    return [re.sub(clean, '', data)]

For CSS Style

def remove_style_code(data):
    clean = re.compile('<style>.*?</style>')
    return [re.sub(clean, '', data)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.