0

I have a web scraper that pulls articles from CNN, FOX, and BBC in BeautifulSoup. Then after some preprocessing, I return raw articles to an API. However, I cannot figure out how to completely remove HTML tags that contain an annoying class in Python. I tried lxml cleaner but and I can remove tags, but not only the tags which contain a certain class.

If in this example I am trying to remove "help", I would like a script that would turn HTML that looks like this:

<p class="help">Here are some tips which are useful</p>
<p> Welcome to webscraping 101 </p>
<p class="help>These are the tips </p>

into this:

<p> Welcome to webscraping 101 </p>

1 Answer 1

4

To remove all tags under the help class, you can use the .decompose() method:

removes a tag from the tree, then completely destroys it and its contents

for tag in soup.find_all("p", class_="help"):
    tag.decompose()

print(soup.prettify())
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.