6

Does anyone have some sample code that illustrates how to use Python's Beautiful Soup to strip all html tags, except some, from a string of text?

I want to strip all javascript and html tags everything except:

<a></a>
<b></b>
<i></i>

And also things like:

<a onclick=""></a>

Thanks for helping -- I couldn't find much on the internet for this purpose.

1 Answer 1

8
import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
        print(tag)

yields

<i>paragraph</i>
<a onclick="">one</a>
<i>paragraph</i>
<b>two</b>

If you just want the text contents, you could change print(tag) to print(tag.string).

If you want to remove an attribute like onclick="" from the a tag, you could do this:

if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
    if tag.name=='a':
        del tag['onclick']
    print(tag)
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you -- any way to remove the onclick=""
add 'tag.attrs=[]' before printing to remove all attributes. If you need more control, tag.attrs is just a list of (name,value) pairs you can play with as you need.
This is probably an old version. I think "Tag" in bs4 is bs4.element.Tag
Also, this does not preserve the text between tags (e.g., "This is" in your example). The question is ambiguous on this, but I want all text content, and some of the tags (e.g., a, b, i in this example).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.