Python has a HTMLParser module for this.
Here is some code which does what you want:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "<%s>"%tag
def handle_endtag(self, tag):
print "</%s>"%tag
parser = MyHTMLParser();
parser.feed("""<html><head>Headline<html><head>more words
</script>even more words</script>
<html><head>Headline<html><head>more words
</script>even more words</script>
""")
Enter your string in parser.feed
Output:
$ python htmlparser.py
<html>
<head>
<html>
<head>
</script>
</script>
<html>
<head>
<html>
<head>
</script>
</script>
This discussion on SO should help: Using HTMLParser in Python efficiently