0

I am using python and want to remove all html tags from a string that aren't enclosed in certain tags. In this example, I want to remove all the html tags that aren't enclosed in the <header>tags</header> and also not remove that enclosing tag as well.

For example:

<h1>Morning</h1>
<header>
    <h1>Afternoon</h1>
    <h2>Evening</h2>
</header>
<h2>Night</h2>

Result:

Morning
<header>
    <h1>Afternoon</h1>
    <h2>Evening</h2>
</header>
Night

I've spent hours on it but no luck. I know that the following will find ALL tags:

re.sub('<.*?>', '', mystring)

And this will find anything within the header tags:

re.sub('<header>.*?</header>', '', mystring)

But how do I negate it, so that the first regex ignores what the second regex finds? Any help is greatly appreciated! Thank you! :)

3
  • 7
    Do not use regex to process HTML (stackoverflow.com/questions/701166/…). Learn how to use Beautifulsoup and make your life much easier. Commented Jul 6, 2017 at 23:46
  • I'm using it to process html documents that are in a very specific format (each document is the exact same format and has many strict rules on what's written) so there won't be any of those wild html tags within tags and etc. I'm in need of this asap so I don't really have time right now but will definitely learn to use Beautifulsoup in the near future! Commented Jul 6, 2017 at 23:57
  • 1
    @cullan I really, really recommend BS4. All it takes is a quick pip install beautifulsoup4 followed by running the code in my answer. :) Commented Jul 6, 2017 at 23:58

1 Answer 1

3

You can do this quickly and easily using BeautifulSoup, as mentioned by Josep Valls in the comments. Here's how:

from bs4 import BeautifulSoup

soup = BeautifulSoup('''<h1>Morning</h1>
<header>
    <h1>Afternoon</h1>
    <h2>Evening</h2>
</header>
<h2>Night</h2>''', 'html.parser')

for tag in soup.find_all(recursive=False):
    if not tag.findChildren():
        tag.unwrap()

print(soup)

This prints out:

Morning
<header>
<h1>Afternoon</h1>
<h2>Evening</h2>
</header>
Night
Sign up to request clarification or add additional context in comments.

3 Comments

hey COLDSPEED, thank you for all your help, but can you modify your answer to help me with my modified question? greatly appreciated :)
@cullan I'd love to help, but modifying a question and completely resetting the context is not a good idea and is not helpful to future readers. Could you please post a new question?
@cullan Copy your current question and then roll back your changes. Then, post a new question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.