3

I'm actually working with HTMLparser for python, i'm trying to get a HTML subtree contained in a specific node. I have a generic parser doing its job well, and once the interesting tag found, I would like to feed another specific HTMLParser with the data in this node.

This is an example of what i want to do :

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__(self)
       self.divFound = False

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True

   def handle_data (self, data):
       if self.divFound:
           print data    ## print nothing
           parser = specificParser ()
           parser.feed (data)
           self.divFound = False

and feed the genericParser with something like :

<html>
<head></head>
<body>
   <div class='good'>
      <ul>
         <li>test1</li>
         <li>test2</li>
      </ul>
   </div>
</body>
</html>

but in the python documentation of HTMLParser.handle_data :

This method is called to process arbitrary data (e.g. text nodes and the content of <script>...</script> and <style>...</style>).

In my genericParser, the data got in handle_data is empty because my <div class='good'> isn't a text node.

How can I retrieve the raw HTML data inner my div using HTMLParser ?

Thanks in advance

6
  • It would be easier to use a DOM parser to extract a subtree. Are you stuck with HTMLParser? Commented Dec 3, 2013 at 20:59
  • I tried to use the HTMLParser because a big part of the project is already done with it, and I found this problem to parse subtree. Finally I start recording the html tree in a buffer, to use it at the handle_endtag() ending the interesting block. It's not the solution I thought, but i'm not stuck anymore. Thanks for your suggestion Commented Dec 4, 2013 at 11:00
  • So, have you already solved it? Commented Dec 4, 2013 at 13:46
  • Yes, i'm going to answer with the solution, but i wait few hours to see if there is a better solution than buffering the html. Commented Dec 4, 2013 at 14:01
  • 1
    I asked you because was thinking in extracting the subtree with BeautifulSoup and then call your specificParser. If you were stuck with HTMLParser my idea was also to record each node until the closing </div> but I see you have been already working on it. Commented Dec 4, 2013 at 14:05

1 Answer 1

1

I've solved this problem by buffering all data encountered in the interesting HTML node.

This one works but isn't very "clean" because the GenericParser has to parse the whole interesting block before fed the SpecificParser with it. Here is a "light" (without any errors handling) solution :

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__ (self)
       self.divFound = False
       self.buff = ""
       self.level = 0

   def computeRecord (self, tag, attrs):
        mystr = "<" + tag + " "
        for att, val in attrs:
            mystr += att+"='"+val+ "' "
        mystr += ">"
        return mystr

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True
       elif self.divFound:
          self.level += 1
          self.buff += self.computeRecord (tag, attrs)

   def handle_data (self, data):
       if self.divFound:
          self.buff += data


   def handle_endtag (self, tag):
      if self.divFound:
         self.buff += "</" + tag + ">"
         self.level -= 1
         if (self.level == 0):
            self.divFound = False
            print self.buff

The output is as desired :

<ul>
     <li>test1</li>
     <li>test2</li>
</ul>

As Birei said in comments, i would have been easier to extract the subtree with BeautifulSoup

soup = BeaufitulSoup (html)
div = soup("div", {"class" : "good"})
children = div[0].findChildren ()
print children[0]   #### desired output
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.