How can you completely remove HTML tags containing a class in python?

Question

I have a web scraper that pulls articles from CNN, FOX, and BBC in BeautifulSoup. Then after some preprocessing, I return raw articles to an API. However, I cannot figure out how to completely remove HTML tags that contain an annoying class in Python. I tried lxml cleaner but and I can remove tags, but not only the tags which contain a certain class.

If in this example I am trying to remove "help", I would like a script that would turn HTML that looks like this:

<p class="help">Here are some tips which are useful</p>
<p> Welcome to webscraping 101 </p>
<p class="help>These are the tips </p>

into this:

<p> Welcome to webscraping 101 </p>

MendelG · Accepted Answer · 2022-03-08 21:50:17Z

4

To remove all tags under the help class, you can use the .decompose() method:

removes a tag from the tree, then completely destroys it and its contents

for tag in soup.find_all("p", class_="help"):
    tag.decompose()

print(soup.prettify())

answered Mar 8, 2022 at 21:50

MendelG

20.6k5 gold badges38 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How can you completely remove HTML tags containing a class in python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related