1

I'd like to strip all html / javascript except for:

<b></b>
<ul></ul>
<li></li>
<a></a>

Thanks.

0

3 Answers 3

4

Do you want a way that's fast or a way that's correct? A regex-based approach is unlikely to be correct and may open you up to XSS attacks.

You should use an HTML parser like Beautiful Soup or even htmllib.

Also, <a> can contain javascript: hrefs and there are also the various on* attributes which are javascript. You probably want to strip all of those out. In general, a whitelist approach is best: only keep attributes (and attribute values) you know are safe.

Sign up to request clarification or add additional context in comments.

Comments

2

While I agree with Laurence, there are occasions where a quick and dirty 99% approach gets the job done without creating other problems.

Here's an example that demonstrates a regex based approach --

import re

CLEANBODY_RE = re.compile(r'<(/?)(.+?)>', re.M)

def _repl(match):
    tag = match.group(2).split(' ')[0]
    if tag == 'p':
        return '<%sp>' % match.group(1)
    elif tag in ('a', 'br', 'ul', 'li', 'b', 'strong', 'em', 'i'):
        return match.group(0)
    return u''

def cleanbody(html):
    return CLEANBODY_RE.sub(_repl, html)

Comments

0

Replace the elements you want to keep with a place holder value, then regex out any remaining <.*>, finally replace the placeholders with the corresponding html elements.

1 Comment

I suggest using BBcode for the placeholders, which gives you the nice side-effect of support BBcode without any extra computation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.