3

There are many ways to extract text from html file but I'd like to do the opposite and have the text removed while the structure and javascript code stay intact.

For example Remove all


while keeping

Is there an easy way to do this? Any help is greatly appreciated. Cheers

1
  • Be more specific please: provide sample HTML input, and the expected output. Commented Jul 22, 2015 at 11:39

1 Answer 1

3

I would go with BeautifulSoup:

from bs4 import BeautifulSoup
from bs4.element import NavigableString
from copy import copy

def strip_content(in_tag):
    tag = copy(in_tag) # remove this line if you don't care about your input
    if tag.name == 'script':
        # Do no mess with scripts
        return tag
    # strip content from all children
    children = [strip_content(child) for child in tag.children if not isinstance(child, NavigableString)]
    # remove everything from the tag
    tag.clear()
    for child in children:
        # Add back stripped children
        tag.append(child)
    return tag

def test(filename):
    soup = BeautifulSoup(open(filename))
    cleaned_soup = strip_content(soup)
    print(cleaned_soup.prettify())

if __name__ == "__main__":
    test("myfile.html")
Sign up to request clarification or add additional context in comments.

1 Comment

Great, Thank you! Exactly what I wanted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.