Get HTML subtree from HTMLparser

Question

I'm actually working with HTMLparser for python, i'm trying to get a HTML subtree contained in a specific node. I have a generic parser doing its job well, and once the interesting tag found, I would like to feed another specific HTMLParser with the data in this node.

This is an example of what i want to do :

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__(self)
       self.divFound = False

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True

   def handle_data (self, data):
       if self.divFound:
           print data    ## print nothing
           parser = specificParser ()
           parser.feed (data)
           self.divFound = False

and feed the genericParser with something like :

<html>
<head></head>
<body>
   <div class='good'>
      <ul>
         <li>test1</li>
         <li>test2</li>
      </ul>
   </div>
</body>
</html>

but in the python documentation of HTMLParser.handle_data :

This method is called to process arbitrary data (e.g. text nodes and the content of <script>...</script> and <style>...</style>).

In my genericParser, the data got in handle_data is empty because my <div class='good'> isn't a text node.

How can I retrieve the raw HTML data inner my div using HTMLParser ?

Thanks in advance

It would be easier to use a DOM parser to extract a subtree. Are you stuck with HTMLParser? — Birei
– Birei, Commented Dec 3, 2013 at 20:59
I tried to use the HTMLParser because a big part of the project is already done with it, and I found this problem to parse subtree. Finally I start recording the html tree in a buffer, to use it at the handle_endtag() ending the interesting block. It's not the solution I thought, but i'm not stuck anymore. Thanks for your suggestion — Marcassin
– Marcassin, Commented Dec 4, 2013 at 11:00
Yes, i'm going to answer with the solution, but i wait few hours to see if there is a better solution than buffering the html. — Marcassin
– Marcassin, Commented Dec 4, 2013 at 14:01
I asked you because was thinking in extracting the subtree with BeautifulSoup and then call your specificParser. If you were stuck with HTMLParser my idea was also to record each node until the closing </div> but I see you have been already working on it. — Birei
– Birei, Commented Dec 4, 2013 at 14:05

Marcassin · Accepted Answer · 2013-12-04 15:18:43Z

I've solved this problem by buffering all data encountered in the interesting HTML node.

This one works but isn't very "clean" because the GenericParser has to parse the whole interesting block before fed the SpecificParser with it. Here is a "light" (without any errors handling) solution :

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__ (self)
       self.divFound = False
       self.buff = ""
       self.level = 0

   def computeRecord (self, tag, attrs):
        mystr = "<" + tag + " "
        for att, val in attrs:
            mystr += att+"='"+val+ "' "
        mystr += ">"
        return mystr

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True
       elif self.divFound:
          self.level += 1
          self.buff += self.computeRecord (tag, attrs)

   def handle_data (self, data):
       if self.divFound:
          self.buff += data


   def handle_endtag (self, tag):
      if self.divFound:
         self.buff += "</" + tag + ">"
         self.level -= 1
         if (self.level == 0):
            self.divFound = False
            print self.buff

The output is as desired :

<ul>
     <li>test1</li>
     <li>test2</li>
</ul>

As Birei said in comments, i would have been easier to extract the subtree with BeautifulSoup

soup = BeaufitulSoup (html)
div = soup("div", {"class" : "good"})
children = div[0].findChildren ()
print children[0]   #### desired output

Collectives™ on Stack Overflow

Get HTML subtree from HTMLparser

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related