1

I'm trying to strip out particular chunks of HTML documents, particularly Javascript (<script></script>) and inline CSS (<style></style>). Currently I'm trying to use re.sub() but am not having any luck with Multiline. Any tips?

import re

s = '''<html>
<head>
  <title>Some Template</title>
  <script type="text/javascript" src="{path to Library}/base.js"></script>
  <script type="text/javascript" src="something.js"></script>
  <script type="text/javascript" src="simple.js"></script>
</head>
<body>
  <script type="text/javascript">
    // HelloWorld template
    document.write(examples.simple.helloWorld());
  </script>
</body>
</html>'''

print(re.sub('<script.*script>', '', s, count=0, flags=re.M))
5
  • 1
    Why are you not going for BeautifulSoup? Commented Mar 2, 2016 at 6:01
  • @JasonEstibeiro My understanding of BeautifulSoup isn't extensive, but all my uses have been for parsing HTML and extracting content. I only want to clean out parts of it, and convert other parts like Bold and Italic tags to a different markup. I'm not aware of BS4 being capable of that. Commented Mar 2, 2016 at 6:04
  • Well, you can clean out parts of the HTML. Have a look at this answer. Commented Mar 2, 2016 at 6:12
  • 1
    You can even modify the HTML tree by changing its tag name and/or attributes. Have a look at the doc. Commented Mar 2, 2016 at 6:14
  • 1
    @JasonEstibeiro Huh... I'll be. Throw that up in an answer and the credit's yours. Commented Mar 2, 2016 at 6:19

2 Answers 2

2

Alternatively, since you are parsing and modifying HTML, I'd suggest to use a HTML parser like BeautifulSoup.

If you simply want to strip/remove all the script tags within the HTML tree. You can use .decompose() or .extract().

.extract() will return the tag that was extracted whereas .decompose() will simply destroy it.

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, "html.parser")
for i in soup('script'):
    i.decompose()

print(soup)

As discussed in the comments, you can do additional modifications to the HTML tree. You may refer the docs for more info.

Sign up to request clarification or add additional context in comments.

Comments

1

You actually need DOTALL modifier not Multiline .

print(re.sub(r'(?s)<script\b.*?</script>', '', s))

This would remove the leading spaces exists before script tag.

print(re.sub(r'(?s)\s*<script\b.*?</script>', '', s))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.