4

after hours of trying i'm here to ask. i want to remove all the occurences of js event attributes and style attribute from POSTed text. it may or may not contain new lines.

Posted example text:

<a href="http://www.google.com" onclick="unwanted_code" style="unwanted_style" ondblclick="unwanted_code" onmouseover="unwanted_code">google</a> is a search engine. There are other engines too. <a href="http://www.yahoo.com" onclick="unwanted_code" ondblclick="unwanted_code" onmouseover="unwanted_code" style="unwanted_style">yahoo</a> is another engine.

first try:

$pattern[0] = '/(<[^>]+) on.*=".*?"/iU';
$replace[0] = '$1';
$pattern[1] = '/(<[^>]+) style=".*?"/iU';
$replace[1] = '$1';
$out = preg_replace($pattern, $replace, $in);

output:

<a href="http://www.google.com">yahoo</a> is another engine.

second try:

$out = preg_replace_callback('/(<[^>]+) on.*=".*?"/iU', function($m) {return $m[1];}, $in);

output:

<a href="http://www.google.com">yahoo</a> is another engine.

output i'm trying to get is:

<a href="http://www.google.com">google</a> is a search engine. There are other engines too. <a href="http://www.yahoo.com">yahoo</a> is another engine.

anyone helping me out?

5
  • Does it need to be a regexp-based answer? Commented Feb 24, 2014 at 8:57
  • You're probably best using some form of HTML filtering (HTMLPurifier comes to mind) and set what tags and attributes are allowed. Commented Feb 24, 2014 at 8:58
  • yes regex-based please. Commented Feb 24, 2014 at 8:59
  • An answer here: Remove on* JS event attributes from HTML tags, altho it doesn't include style it's easy to add if you really want the regex solution. Commented Feb 24, 2014 at 9:15
  • How are you going to combat people introducing <script> tags? Commented Feb 24, 2014 at 11:19

3 Answers 3

3

How about:

$content = '<a href="http://www.google.com" onclick="unwanted_code" style="unwanted_style" ondblclick="unwanted_code" onmouseover="unwanted_code">google</a> is a search engine. There are other engines too. <a href="http://www.yahoo.com" onclick="unwanted_code" ondblclick="unwanted_code" onmouseover="unwanted_code" style="unwanted_style">yahoo</a> is another engine.';

$result = preg_replace('%(<a href="[^"]+")[^>]+(>)%m', "$1$2", $content);
echo $result,"\n";

output:

<a href="http://www.google.com">google</a> is a search engine. There are other engines too. <a href="http://www.yahoo.com">yahoo</a> is another engine.
Sign up to request clarification or add additional context in comments.

Comments

3

Even thought the question is tagged as , I'm adding this answer anyway, because it's more robust for input validation; this particular solution only accepts certain tags and restricts the allowed attributes:

$doc->loadHTML('<html><body>' . $html . '</body></html>');

$allowedTags = ['a' => ['href']];

$body = $doc->getElementsByTagName('body')->item(0);

$elements = $body->getElementsByTagName('*');
for ($k = 0; $element = $elements->item($k); ) {
    $name = strtolower($element->nodeName);
    if (isset($allowedTags[$name])) {
        $allowedAttributes = $allowedTags[$name];
        for ($i = 0; $attribute = $element->attributes->item($i); ) {
            if (!in_array($attribute->nodeName, $allowedAttributes)) {
                $element->removeAttribute($attribute->nodeName);
                continue;
            }
            ++$i;
        }
    } else {
        $element->parentNode->removeChild($element);
        continue;
    }
    ++$k;
}

$result = '';

foreach ($body->childNodes as $childNode) {
    $result .= $doc->saveXML($childNode);
}

echo $result;

Comments

0

Since you want to preserve an attribute (href), you cannot delete them all. With this code you can achieve what you want, but with listing all the unwanted attributes:

preg_replace('#(onclick|style|ondblclick|onmouseover)="[^"]+"#', '', $in);

Maybe it can be simplyfied but this just works :)

7 Comments

yes that way it works but i tried to do away with one elegant regex. furthermore you have to include all other events (like onchange onmouseout etc.). another thing i learned but didnt understand is that if i remove U modifier it only removes one "on.*" attribute even if in the text there are 10.
one more thing my original regex removes some text that is in between two matches which it shouldnt remove. i didnt get that too.
No no, plenty of issues here, eg: <a onclick = "alert('lol')">1</a>, see stackoverflow.com/a/9466152/107152 for a more comlete regex. ;-)
my aim is for malicious code. because normal users wouldnt craft bad POST content. they just write and post.
A more simplified version: preg_replace('/(on.*?|style)=".*?"/', ' ', $in)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.