What's the fastest way to strip certain html tags in a Python string?

Question

I'd like to strip all html / javascript except for:

<b></b>
<ul></ul>
<li></li>
<a></a>

Thanks.

Laurence Gonsalves · Accepted Answer · 2010-12-12 00:04:57Z

4

Do you want a way that's fast or a way that's correct? A regex-based approach is unlikely to be correct and may open you up to XSS attacks.

You should use an HTML parser like Beautiful Soup or even htmllib.

Also, <a> can contain javascript: hrefs and there are also the various on* attributes which are javascript. You probably want to strip all of those out. In general, a whitelist approach is best: only keep attributes (and attribute values) you know are safe.

answered Dec 12, 2010 at 0:04

Laurence Gonsalves

144k38 gold badges264 silver badges315 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

s29 · Accepted Answer · 2011-11-11 06:01:02Z

2

While I agree with Laurence, there are occasions where a quick and dirty 99% approach gets the job done without creating other problems.

Here's an example that demonstrates a regex based approach --

import re

CLEANBODY_RE = re.compile(r'<(/?)(.+?)>', re.M)

def _repl(match):
    tag = match.group(2).split(' ')[0]
    if tag == 'p':
        return '<%sp>' % match.group(1)
    elif tag in ('a', 'br', 'ul', 'li', 'b', 'strong', 'em', 'i'):
        return match.group(0)
    return u''

def cleanbody(html):
    return CLEANBODY_RE.sub(_repl, html)

answered Nov 11, 2011 at 6:01

s29

2,05725 silver badges21 bronze badges

Comments

Daniel · Accepted Answer · 2010-12-11 23:28:48Z

0

Replace the elements you want to keep with a place holder value, then regex out any remaining <.*>, finally replace the placeholders with the corresponding html elements.

answered Dec 11, 2010 at 23:28

Daniel

1263 bronze badges

1 Comment

Daniel Over a year ago

I suggest using BBcode for the placeholders, which gives you the nice side-effect of support BBcode without any extra computation.

Collectives™ on Stack Overflow

What's the fastest way to strip certain html tags in a Python string?

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related