3

I'm looking for a way to remove all JavaScripts tags from a html string.

Following regex works fine, but I would like to add an exception:

$html = preg_replace('#<script[^>]*>.*?</script>#is', '', $html);

How can I add a rule that scripts of a type text/html are getting ignored?

<script type="text/html" ... > ... </script> 

Any suggestion?

Thanks in advance.

3
  • 3
    Use an HTML parser instead of regex: php.net/manual/en/book.dom.php Commented Jul 7, 2011 at 23:00
  • Cool, thats what I'm doing anyway. Using Zend_Dom_Query for that at the moment. Have you got an idea how the selector for xpath would look like? Commented Jul 7, 2011 at 23:04
  • Doesn't preg_replace allow you to specify the e flag on the regular expression so that the replacement string is treated as code. Can't you use that with a replacement expression that looks for the type="text/html" and returns the whole script tag if it's there and blank otherwise. Commented Jul 8, 2011 at 3:03

2 Answers 2

3

You may not be trying to sanitize untrusted HTML, but just so readers of this question don't get the wrong idea:

This won't remove javascript outside <script> elements : <img src=bogus onerror=alert(42)>.

It won't remove barely obfuscated scripts : <script>alert(42)</script >.

It will turn invalid content into scripts : <scrip<script></script>t>alert(42)</script>.

I'm not saying this is what you're trying to do. You may have perfectly good reasons for doing this that don't have to do with untrusted inputs, but, for later readers, don't try to roll your own HTML sanitizer with just regular expressions.

Sign up to request clarification or add additional context in comments.

1 Comment

Good comment and you are right, but to be honest Im not too fussed about that. ;) I'm not trying to remove inline scripts. Its more about the exception.
1

Use a greedy match that won't fall to Mike's pointers, like so:

$html = preg_replace('#<script.*</script>#is', '', $html);

This should (greedily) match all script tags. As for the exception, I'm not sure how to do that, sorry.

3 Comments

If the page has script tags in both the head and near the bottom of the page, this regex will pretty much delete the entire page.
Then that's a poorly designed page.
don't be greedy, use .*? instead of .*

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.