extracting html between tags using perl

Question

I want to extract all the html between the tag of a string or file Ive been looking at using (perl) with the module html::parser, I thought this would be a simple task but its turning out to be quite tricky? I found some code which works but dont know how to save results to a string ?? any help appreciated or if you can show me some code on how this can be achived using HTML::TokeParser or similar.

Thanks

my $content=<<EOF;
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
   <title>Some title goes here</title>
 </head>
 <body bgcolor="#FFFFFF">
   <div id="leftcol">
     menu column
  </div>
  <div id="body">
   <p>some text goes here some text goes here<br />
    some text goes here some text goes here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
    <p><img src="img.gif" /> image here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
   </div>
    <div id="rightcol">
   news column
    </div>
 </body>
</html>
EOF


my $p = HTML::Parser->new( api_version => 3 );
$p->handler( start => \&start_handler, "self,tagname,attr" );
$p->parse($content);
exit;

sub start_handler {
    my $self = shift;
    my $tagname  = shift;
    my $attr = shift;
    my $text = shift;
    return unless ( $tagname eq 'body' );
    $self->handler( start => sub { print shift }, "text" );
    $self->handler( text =>  sub { print shift }, "text" );
    $self->handler( end  => sub {
    my ($endtagname, $self, $text) = @_;
         if($endtagname eq $tagname) {
         $self->eof;
         } else {
              print $text;
        }
    }, "tagname,self,text");
 }

if i modify the above Sub routine start text and end handlers like below

why doesnt the text from those varibles get saved in mine ?

$self->handler( start => sub {  my ($text) = @_; $inner_body = $inner_body. $text; }, "text" );
$self->handler( text =>  sub {  my ($text) = @_; $inner_body = $inner_body. $text; }, "text" );
$self->handler( end  => sub {
       my ($endtagname, $self, $text) = @_;
       if($endtagname eq $tagname) {
            $self->eof;
           } else {
             $inner_body = $inner_body. $text;
           }
        }, "tagname,self,text");

}

print $inner_body; # <-- prints blank ???

Desired output to be saved in varible

   <div id="leftcol">
     menu column
  </div>
  <div id="body">
   <p>some text goes here some text goes here<br />
    some text goes here some text goes here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
    <p><img src="img.gif" /> image here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
   </div>
    <div id="rightcol">
   news column
    </div>

Re your update: How can you claim that exit(0); print $inner_body; prints blank? — ikegami
– ikegami, Commented May 29, 2013 at 22:22
cause it does ? why is there something im missing when i print $inner_body i get the following [root@nurelay ~]# ./parsetest [root@nurelay ~]# — user2429569
– user2429569, Commented May 29, 2013 at 22:46

ikegami · Accepted Answer · 2013-05-28 18:23:10Z

1

All you have to do is replace

print ...;

with

$inner_body .= ...;

Personally, I would use XML::LibXML instead. It can handle both HTML and XML (by using the appropriate method of the parser). What you have there is XHTML (which is XML-compatible), so we use parse_string instead of parse_html_string.

use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(h => 'http://www.w3.org/1999/xhtml');

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($content);
my ($body_node) = $xpc->findnodes('/h:html/h:body', $doc)
   or die;

my $inner_body = join '', map $_->toString(), $body_node->childNodes();
print $inner_body;

answered May 28, 2013 at 18:23

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user2429569 Over a year ago

How does the above method handle invalid html?

ikegami Over a year ago

Again, you don't have HTML, you have XHTML/XML. It doesn't do anything if it isn't valid -- it doesn't even know if it's valid or not -- but it will die if it's not well-formed. You can use $parser->recover(...) to have it recover if possible.

ikegami Over a year ago

(Example of invalid XHTML: A DIV element inside of a P element. Example of malformed XHTML: A missing closing tag or an unescaped &)

user2429569 Over a year ago

Thanks,, i'll give this try aswell .. also interested in getting html::parse to work.

Collectives™ on Stack Overflow

extracting html between tags using perl

why doesnt the text from those varibles get saved in mine ?

print $inner_body; # <-- prints blank ???

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

why doesnt the text from those varibles get saved in mine ?

print $inner_body; # <-- prints blank ???

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related