4

So I've set up a page where people can submit tutorials. These tutorials are built basically by a TinyMCE editor.

Anyway one could abuse it and just POST their own, non escaped text and insert some malicious <script>.

So my question is: would it be safe enough to remove <script> tags with an regular expression? I would run this regex on my backend, before storing it.

I've found this expression for example

<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>
0

3 Answers 3

3

No. It's possible they can use multiple-byte characters to bypass your regexp, or use a combination of mismatched opening and closing tags sneakily, creating fake closing script tags, quoting them in attributes, etc.... Don't attempt to parse potentially noisy/malformed HTML with RegEx, use an HTML parsing engine designed to deal with such concerns. See the famous answer on parsing HTML with regex here: RegEx match open tags except XHTML self-contained tags

If you're looking for one, I swear by this PHP library: http://simplehtmldom.sourceforge.net/
It first cleans the document, by converting noise to entities, before taking into account "script", "style", and "textarea" elements which anything found between the opening and closing tag is meant to be text not HTML. Then it parses the result into a DOM structure to can parse much in the same way you can parse a document with the DOM methods in JavaScript. It comes with a "save" method as well, (which will result the string), so after you're done stripping tags in the page, you'll have your modified, well-formed document. The library I have also tested with large data, and when I was using a regexp before with large which was failing to due PHP memory limits being reached with the regexp, this library parsed such documents without memory issues. So I've tested it quite thoroughly and used it on large projects before, it has never let me down -- like built-in PHP functions/classes have with malformed data.

Edit: Here's an example how to break it:

<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>

Just because the regex is used by jQuery, doesn't make it safe for the server.

Even if you used the "gi" flags, it doesn't matter:

var str="<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>";
str=str.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi,'');
//the "g" flag doesn't help here since you need to start from the beginning, not continue in the middle
alert(str);

But if you used it in a loop, rather than with the "g" flag, you'll get rid of this case I bring up.

Edit 2: If the purpose is sanitizing user-input from all JavaScript concerns, like "onload" and "onclick" properties, why re-invent the wheel? There's http://htmlpurifier.org/ (see the demo)

Sign up to request clarification or add additional context in comments.

18 Comments

simplehtmldom is entirely based on regex and its pattern to find script parts is (from the source): $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is"); (similar to the one posted in the question, except that the simplehtmldom version can crash more easily). shd is the worst html parser and is from far slower (and has less features) than the build-in class DOMDocument.
it's only ok to use regex at certain steps, but not for the entire process. They use that logic after doing $this->remove_noise("'<!--(.*?)-->'is"); since comments can also affect the regexp he links to. You can't take it out of context and skip steps and think it's safe just because jquery uses it. Also simplehtmldom uses a better regex than jQuery, in any case, since they factor in spaces in the tag. Also they don't strip the the tags there and consider the document done. The "remove_noise" method stores a reference to the "stored away noise" while they continue considering "style" etc
The difference here is not "is this regex good" but "can I use this regex a popular library is using, and take it out of context and use it for the entire process and trust it?"
DOMDocument has no functionality to clean the document that I am aware of. See stackoverflow.com/questions/2383349/… But if it does, please enlighten me.
Which browsers do execute your provided example tho? Tried current IE/edge/FF/Chrome - none of them executed anything.
|
1

Instead of regex, why don't you use DOM for that?

$content = "<h1>title</h1><p> test <span>1<!-- regular comment --><script> my script</script></span><script> my script</script></p><script> my script</script> <!--[if IE]><script>alert('XSS');</script><![endif]-->";

// creates a DOMDocument based on your string (without doctype, html and another extra tags), and wraps it in a div
$dom = new DOMDocument();
$dom->loadHTML("<div>{$content}</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

//Removing any comments or conditional comments
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//comment()') as $comment) {
    $comment->parentNode->removeChild($comment);
}

// function to remove any tag
function verifyNodes(DOMNode $node) {
    $removedTags = ['script', 'iframe']; // what tags i want to remove

    foreach ($node->childNodes as $childNode)
    {
        if (in_array($childNode->nodeName, $removedTags)) {
            $childNode->parentNode->removeChild($childNode);
        } elseif ($childNode->hasChildNodes()) {
            verifyNodes($childNode);
        }
    }
}

// calling verifyNodes
verifyNodes($dom);

// get all the content of my first div, and print it
$newContent = $dom->getElementsByTagName('div')->item(0);
foreach ($newContent->childNodes as $childNode) {
    var_dump($dom->saveHTML($childNode));
}

And just like i use nodeName to verify the tag's name, we can also use nodeType if we want to remove other stuff (check the node XML constants list).

4 Comments

Which allows any malicious comment through like: <!--[if IE]><script>alert('XSS');</script><![endif]--> which will execute for any IE user. Plus there's other elements to remove like iframes, embed, object, and attributes like onload, onerror.... It is not even enough to use a list of whitelisted elements and attributes, since an attribute as innocent as "href" can be malicious like so: <a href="javascript:alert(&quot;XSS&quot;)" title="http://google.com">Google</a>. The list is rather long for all things to consider. Better to use htmlpurifier.org
I think is just a matter of knowing what you want to remove. But sure, if the OP thinks he could be missing something is always nice have a library for that. Updating the answer with a generic sample that can be used for remove any tag or comment.
Yeah, but it leaves malicious attributes intact. All the onload/onerror sort of attributes obviously need to be dealt with: <img src="about:blank" onerror="alert('XSS')"> Allowing "style" on any element can run JavaScript in IE: <strong style="x: expression(open(alert('XSS')))">bold text</strong>. (still works in IE 9. Disabled in IE 10, but can be emulated with X-UA-Compatible. IE 11 removes support). The problem with using a blacklist is what if a new CSS property comes out or a new HTML attribute or element? You should be using a tag/attribute whitelist and removing everything else.
Yup, I never said it was covering all possibilities. Plus (ctr+c ctr+v) : "I think is just a matter of knowing what you want to remove. But sure, if the OP thinks he could be missing something is always nice have a library for that."
0

If you can use an engine that supports atomic groups, this will probably
work. This will parse it most closely as to how a browser would parse script
tags.

Find:
(?><script(?:(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)|/)>)(?<=/>)|(?><script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?>)(?<!/>)[\S\s]*?</script\s*>

Replace: empty string


Formatted:

    # If script tags can be <script .... />
    (?>
         <
         script 
         (?:
              (?:
                   \s+ 
                   (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
              )
           |  / 
         )
         > 
    )
    (?<= /> )
 |  
    # Or, if script tags with content can be <script .... > ... </script>
    (?>
         <
         script 
         (?:
              \s+ 
              (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
         )?
         > 
    )
    (?<! /> )
    [\S\s]*? 
    </script \s* >

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.