0

Say I have a text file like this:

<html><head>Headline<html><head>more words
</script>even more words</script>
<html><head>Headline<html><head>more words
</script>even more words</script>

How would I get just the tags into a list like this:

<html>
<head>
<html>
<head>
</script>
</script>
<html>
<head>
<html>
<head>
</script>
</script>
1
  • Is this a continuation of your other question? If it is, you should really edit your other question, rather than re-post Commented Dec 14, 2010 at 5:01

2 Answers 2

6

I think this is what you want:

html_string = ''.join(input_file.readlines())
matches = re.findall('<.*?>', html_string)
for m in matches:
    print m

Hope this helps

Sign up to request clarification or add additional context in comments.

2 Comments

i think you mean: re.findall('<.*?>', html_string)
@JackNull: You're absolutely right. The extra double quotes are a typo and have been retro-actively fixed
4

Python has a HTMLParser module for this.

Here is some code which does what you want:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "<%s>"%tag

    def handle_endtag(self, tag):
        print "</%s>"%tag

parser = MyHTMLParser();
parser.feed("""<html><head>Headline<html><head>more words
        </script>even more words</script>
        <html><head>Headline<html><head>more words
        </script>even more words</script>
        """)

Enter your string in parser.feed

Output:

$ python htmlparser.py 
<html>
<head>
<html>
<head>
</script>
</script>
<html>
<head>
<html>
<head>
</script>
</script>

This discussion on SO should help: Using HTMLParser in Python efficiently

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.