I'm actually working with HTMLparser for python, i'm trying to get a HTML subtree contained in a specific node. I have a generic parser doing its job well, and once the interesting tag found, I would like to feed another specific HTMLParser with the data in this node.
This is an example of what i want to do :
class genericParser(HTMLParser):
def __init__ (self):
HTMLParser.__init__(self)
self.divFound = False
def handle_starttag (self, tag, attrs):
if tag == "div" and ("class", "good") in attrs:
self.divFound = True
def handle_data (self, data):
if self.divFound:
print data ## print nothing
parser = specificParser ()
parser.feed (data)
self.divFound = False
and feed the genericParser with something like :
<html>
<head></head>
<body>
<div class='good'>
<ul>
<li>test1</li>
<li>test2</li>
</ul>
</div>
</body>
</html>
but in the python documentation of HTMLParser.handle_data :
This method is called to process arbitrary data (e.g. text nodes and the content of
<script>...</script>and<style>...</style>).
In my genericParser, the data got in handle_data is empty because my <div class='good'> isn't a text node.
How can I retrieve the raw HTML data inner my div using HTMLParser ?
Thanks in advance
DOMparser to extract a subtree. Are you stuck withHTMLParser?handle_endtag()ending the interesting block. It's not the solution I thought, but i'm not stuck anymore. Thanks for your suggestionBeautifulSoupand then call your specificParser. If you were stuck withHTMLParsermy idea was also to record each node until the closing</div>but I see you have been already working on it.