Remove multiline HTML in Python

Question

I'm trying to strip out particular chunks of HTML documents, particularly Javascript (<script></script>) and inline CSS (<style></style>). Currently I'm trying to use re.sub() but am not having any luck with Multiline. Any tips?

import re

s = '''<html>
<head>
  <title>Some Template</title>
  <script type="text/javascript" src="{path to Library}/base.js"></script>
  <script type="text/javascript" src="something.js"></script>
  <script type="text/javascript" src="simple.js"></script>
</head>
<body>
  <script type="text/javascript">
    // HelloWorld template
    document.write(examples.simple.helloWorld());
  </script>
</body>
</html>'''

print(re.sub('<script.*script>', '', s, count=0, flags=re.M))

@JasonEstibeiro My understanding of BeautifulSoup isn't extensive, but all my uses have been for parsing HTML and extracting content. I only want to clean out parts of it, and convert other parts like Bold and Italic tags to a different markup. I'm not aware of BS4 being capable of that. — David Metcalfe
– David Metcalfe, Commented Mar 2, 2016 at 6:04
Well, you can clean out parts of the HTML. Have a look at this answer. — JRodDynamite
– JRodDynamite, Commented Mar 2, 2016 at 6:12
You can even modify the HTML tree by changing its tag name and/or attributes. Have a look at the doc. — JRodDynamite
– JRodDynamite, Commented Mar 2, 2016 at 6:14
@JasonEstibeiro Huh... I'll be. Throw that up in an answer and the credit's yours. — David Metcalfe
– David Metcalfe, Commented Mar 2, 2016 at 6:19

JRodDynamite · Accepted Answer · 2016-03-02 08:12:47Z

2

Alternatively, since you are parsing and modifying HTML, I'd suggest to use a HTML parser like BeautifulSoup.

If you simply want to strip/remove all the script tags within the HTML tree. You can use .decompose() or .extract().

.extract() will return the tag that was extracted whereas .decompose() will simply destroy it.

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, "html.parser")
for i in soup('script'):
    i.decompose()

print(soup)

As discussed in the comments, you can do additional modifications to the HTML tree. You may refer the docs for more info.

edited Mar 2, 2016 at 8:12

answered Mar 2, 2016 at 6:32

JRodDynamite

12.7k5 gold badges47 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Avinash Raj · Accepted Answer · 2016-03-02 06:04:47Z

1

You actually need DOTALL modifier not Multiline .

print(re.sub(r'(?s)<script\b.*?</script>', '', s))

This would remove the leading spaces exists before script tag.

print(re.sub(r'(?s)\s*<script\b.*?</script>', '', s))

answered Mar 2, 2016 at 6:04

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Collectives™ on Stack Overflow

Remove multiline HTML in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related