Malicious code injection: Is it safe enough to remove script tags by regex?

Question

So I've set up a page where people can submit tutorials. These tutorials are built basically by a TinyMCE editor.

Anyway one could abuse it and just POST their own, non escaped text and insert some malicious <script>.

So my question is: would it be safe enough to remove <script> tags with an regular expression? I would run this regex on my backend, before storing it.

I've found this expression for example

<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>

miken32 · Accepted Answer · 2022-10-19 00:11:58Z

3

No. It's possible they can use multiple-byte characters to bypass your regexp, or use a combination of mismatched opening and closing tags sneakily, creating fake closing script tags, quoting them in attributes, etc.... Don't attempt to parse potentially noisy/malformed HTML with RegEx, use an HTML parsing engine designed to deal with such concerns. See the famous answer on parsing HTML with regex here: RegEx match open tags except XHTML self-contained tags

If you're looking for one, I swear by this PHP library: http://simplehtmldom.sourceforge.net/
It first cleans the document, by converting noise to entities, before taking into account "script", "style", and "textarea" elements which anything found between the opening and closing tag is meant to be text not HTML. Then it parses the result into a DOM structure to can parse much in the same way you can parse a document with the DOM methods in JavaScript. It comes with a "save" method as well, (which will result the string), so after you're done stripping tags in the page, you'll have your modified, well-formed document. The library I have also tested with large data, and when I was using a regexp before with large which was failing to due PHP memory limits being reached with the regexp, this library parsed such documents without memory issues. So I've tested it quite thoroughly and used it on large projects before, it has never let me down -- like built-in PHP functions/classes have with malformed data.

Edit: Here's an example how to break it:

<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>

Just because the regex is used by jQuery, doesn't make it safe for the server.

Even if you used the "gi" flags, it doesn't matter:

var str="<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>";
str=str.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi,'');
//the "g" flag doesn't help here since you need to start from the beginning, not continue in the middle
alert(str);

But if you used it in a loop, rather than with the "g" flag, you'll get rid of this case I bring up.

Edit 2: If the purpose is sanitizing user-input from all JavaScript concerns, like "onload" and "onclick" properties, why re-invent the wheel? There's http://htmlpurifier.org/ (see the demo)

edited Oct 19, 2022 at 0:11

miken32

42.5k16 gold badges127 silver badges177 bronze badges

answered Dec 4, 2015 at 22:42

Ultimater

4,7482 gold badges32 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

Casimir et Hippolyte Over a year ago

simplehtmldom is entirely based on regex and its pattern to find script parts is (from the source): $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is"); (similar to the one posted in the question, except that the simplehtmldom version can crash more easily). shd is the worst html parser and is from far slower (and has less features) than the build-in class DOMDocument.

Ultimater Over a year ago

it's only ok to use regex at certain steps, but not for the entire process. They use that logic after doing $this->remove_noise("''is"); since comments can also affect the regexp he links to. You can't take it out of context and skip steps and think it's safe just because jquery uses it. Also simplehtmldom uses a better regex than jQuery, in any case, since they factor in spaces in the tag. Also they don't strip the the tags there and consider the document done. The "remove_noise" method stores a reference to the "stored away noise" while they continue considering "style" etc

Ultimater Over a year ago

The difference here is not "is this regex good" but "can I use this regex a popular library is using, and take it out of context and use it for the entire process and trust it?"

Ultimater Over a year ago

DOMDocument has no functionality to clean the document that I am aware of. See stackoverflow.com/questions/2383349/… But if it does, please enlighten me.

boop Over a year ago

Which browsers do execute your provided example tho? Tried current IE/edge/FF/Chrome - none of them executed anything.

|

Clyff · Accepted Answer · 2015-12-06 13:29:20Z

1

Instead of regex, why don't you use DOM for that?

$content = "<h1>title</h1><p> test <span>1<!-- regular comment --><script> my script</script></span><script> my script</script></p><script> my script</script> <!--[if IE]><script>alert('XSS');</script><![endif]-->";

// creates a DOMDocument based on your string (without doctype, html and another extra tags), and wraps it in a div
$dom = new DOMDocument();
$dom->loadHTML("<div>{$content}</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

//Removing any comments or conditional comments
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//comment()') as $comment) {
    $comment->parentNode->removeChild($comment);
}

// function to remove any tag
function verifyNodes(DOMNode $node) {
    $removedTags = ['script', 'iframe']; // what tags i want to remove

    foreach ($node->childNodes as $childNode)
    {
        if (in_array($childNode->nodeName, $removedTags)) {
            $childNode->parentNode->removeChild($childNode);
        } elseif ($childNode->hasChildNodes()) {
            verifyNodes($childNode);
        }
    }
}

// calling verifyNodes
verifyNodes($dom);

// get all the content of my first div, and print it
$newContent = $dom->getElementsByTagName('div')->item(0);
foreach ($newContent->childNodes as $childNode) {
    var_dump($dom->saveHTML($childNode));
}

And just like i use nodeName to verify the tag's name, we can also use nodeType if we want to remove other stuff (check the node XML constants list).

edited Dec 6, 2015 at 13:29

answered Dec 5, 2015 at 1:47

Clyff

4,0662 gold badges19 silver badges32 bronze badges

4 Comments

Ultimater Over a year ago

Which allows any malicious comment through like:  which will execute for any IE user. Plus there's other elements to remove like iframes, embed, object, and attributes like onload, onerror.... It is not even enough to use a list of whitelisted elements and attributes, since an attribute as innocent as "href" can be malicious like so: <a href="javascript:alert("XSS")" title="http://google.com">Google</a>. The list is rather long for all things to consider. Better to use htmlpurifier.org

Clyff Over a year ago

I think is just a matter of knowing what you want to remove. But sure, if the OP thinks he could be missing something is always nice have a library for that. Updating the answer with a generic sample that can be used for remove any tag or comment.

Ultimater Over a year ago

Yeah, but it leaves malicious attributes intact. All the onload/onerror sort of attributes obviously need to be dealt with: <img src="about:blank" onerror="alert('XSS')"> Allowing "style" on any element can run JavaScript in IE: <strong style="x: expression(open(alert('XSS')))">bold text</strong>. (still works in IE 9. Disabled in IE 10, but can be emulated with X-UA-Compatible. IE 11 removes support). The problem with using a blacklist is what if a new CSS property comes out or a new HTML attribute or element? You should be using a tag/attribute whitelist and removing everything else.

Clyff Over a year ago

Yup, I never said it was covering all possibilities. Plus (ctr+c ctr+v) : "I think is just a matter of knowing what you want to remove. But sure, if the OP thinks he could be missing something is always nice have a library for that."

user557597 · Accepted Answer · 2015-12-04 23:45:04Z

If you can use an engine that supports atomic groups, this will probably
work. This will parse it most closely as to how a browser would parse script
tags.

Find:
(?><script(?:(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)|/)>)(?<=/>)|(?><script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?>)(?<!/>)[\S\s]*?</script\s*>

Replace: empty string

Formatted:

    # If script tags can be <script .... />
    (?>
         <
         script 
         (?:
              (?:
                   \s+ 
                   (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
              )
           |  / 
         )
         > 
    )
    (?<= /> )
 |  
    # Or, if script tags with content can be <script .... > ... </script>
    (?>
         <
         script 
         (?:
              \s+ 
              (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
         )?
         > 
    )
    (?<! /> )
    [\S\s]*? 
    </script \s* >

Collectives™ on Stack Overflow

Malicious code injection: Is it safe enough to remove script tags by regex?

3 Answers 3

18 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

18 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related