1

I want to extract all the html between the tag of a string or file Ive been looking at using (perl) with the module html::parser, I thought this would be a simple task but its turning out to be quite tricky? I found some code which works but dont know how to save results to a string ?? any help appreciated or if you can show me some code on how this can be achived using HTML::TokeParser or similar.

Thanks

my $content=<<EOF;
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
   <title>Some title goes here</title>
 </head>
 <body bgcolor="#FFFFFF">
   <div id="leftcol">
     menu column
  </div>
  <div id="body">
   <p>some text goes here some text goes here<br />
    some text goes here some text goes here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
    <p><img src="img.gif" /> image here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
   </div>
    <div id="rightcol">
   news column
    </div>
 </body>
</html>
EOF


my $p = HTML::Parser->new( api_version => 3 );
$p->handler( start => \&start_handler, "self,tagname,attr" );
$p->parse($content);
exit;

sub start_handler {
    my $self = shift;
    my $tagname  = shift;
    my $attr = shift;
    my $text = shift;
    return unless ( $tagname eq 'body' );
    $self->handler( start => sub { print shift }, "text" );
    $self->handler( text =>  sub { print shift }, "text" );
    $self->handler( end  => sub {
    my ($endtagname, $self, $text) = @_;
         if($endtagname eq $tagname) {
         $self->eof;
         } else {
              print $text;
        }
    }, "tagname,self,text");
 }

if i modify the above Sub routine start text and end handlers like below

why doesnt the text from those varibles get saved in mine ?

$self->handler( start => sub {  my ($text) = @_; $inner_body = $inner_body. $text; }, "text" );
$self->handler( text =>  sub {  my ($text) = @_; $inner_body = $inner_body. $text; }, "text" );
$self->handler( end  => sub {
       my ($endtagname, $self, $text) = @_;
       if($endtagname eq $tagname) {
            $self->eof;
           } else {
             $inner_body = $inner_body. $text;
           }
        }, "tagname,self,text");

}

print $inner_body; # <-- prints blank ???

Desired output to be saved in varible


   <div id="leftcol">
     menu column
  </div>
  <div id="body">
   <p>some text goes here some text goes here<br />
    some text goes here some text goes here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
    <p><img src="img.gif" /> image here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
   </div>
    <div id="rightcol">
   news column
    </div>
3
  • stackoverflow.com/questions/16615269/… Commented May 28, 2013 at 18:23
  • Re your update: How can you claim that exit(0); print $inner_body; prints blank? Commented May 29, 2013 at 22:22
  • cause it does ? why is there something im missing when i print $inner_body i get the following [root@nurelay ~]# ./parsetest [root@nurelay ~]# Commented May 29, 2013 at 22:46

1 Answer 1

1

All you have to do is replace

print ...;

with

$inner_body .= ...;

Personally, I would use XML::LibXML instead. It can handle both HTML and XML (by using the appropriate method of the parser). What you have there is XHTML (which is XML-compatible), so we use parse_string instead of parse_html_string.

use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(h => 'http://www.w3.org/1999/xhtml');

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($content);
my ($body_node) = $xpc->findnodes('/h:html/h:body', $doc)
   or die;

my $inner_body = join '', map $_->toString(), $body_node->childNodes();
print $inner_body;
Sign up to request clarification or add additional context in comments.

4 Comments

How does the above method handle invalid html?
Again, you don't have HTML, you have XHTML/XML. It doesn't do anything if it isn't valid -- it doesn't even know if it's valid or not -- but it will die if it's not well-formed. You can use $parser->recover(...) to have it recover if possible.
(Example of invalid XHTML: A DIV element inside of a P element. Example of malformed XHTML: A missing closing tag or an unescaped &)
Thanks,, i'll give this try aswell .. also interested in getting html::parse to work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.