Using Beautiful Soup to strip html tags from a string

Question

Does anyone have some sample code that illustrates how to use Python's Beautiful Soup to strip all html tags, except some, from a string of text?

I want to strip all javascript and html tags everything except:

<a></a>
<b></b>
<i></i>

And also things like:

<a onclick=""></a>

Thanks for helping -- I couldn't find much on the internet for this purpose.

unutbu · Accepted Answer · 2010-12-12 21:57:57Z

8

import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
        print(tag)

yields

<i>paragraph</i>
<a onclick="">one</a>
<i>paragraph</i>
<b>two</b>

If you just want the text contents, you could change print(tag) to print(tag.string).

If you want to remove an attribute like onclick="" from the a tag, you could do this:

if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
    if tag.name=='a':
        del tag['onclick']
    print(tag)

edited Dec 12, 2010 at 21:57

answered Dec 12, 2010 at 21:27

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ensnare Over a year ago

Thank you -- any way to remove the onclick=""

Spacedman Over a year ago

add 'tag.attrs=[]' before printing to remove all attributes. If you need more control, tag.attrs is just a list of (name,value) pairs you can play with as you need.

dfrankow Over a year ago

This is probably an old version. I think "Tag" in bs4 is bs4.element.Tag

dfrankow Over a year ago

Also, this does not preserve the text between tags (e.g., "This is" in your example). The question is ambiguous on this, but I want all text content, and some of the tags (e.g., a, b, i in this example).

Collectives™ on Stack Overflow

Using Beautiful Soup to strip html tags from a string

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related