Parse html using Perl

Question

I have the following HTML-

<div>
   <strong>Date: </strong>
       19 July 2011
</div>

I have been using HTML::TreeBuilder to parse out particular parts of html that are using either tags or classes however the aforementioned html is giving me difficulty in trying to extract the date only.

For instance I tried-

for ( $tree->look_down( '_tag' => 'div'))
{ 
my $date  = $_->look_down( '_tag' => 'strong' )->as_trimmed_text;

But that seems to conflict with an earlier use of <strong>. I am looking to parse out just the '19 July 2011'. I have read the documentation on TreeBuilder but can not find a way of doing this.

How can I do this using TreeBuilder?

Dave Cross · Accepted Answer · 2011-07-21 16:04:57Z

3

The "dump" method is invaluable in finding your way around an HTML::TreeBuilder object.

The solution here is to get the parent element of the element you're interested in (which is, in this case, the <div>) and iterate across its content list. The text you're interested in will be plain text nodes, i.e. elements in the list that are not references to HTML::Element objects.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

$tree->parse(<<END_OF_HTML);
<div>
   <strong>Date: </strong>
       19 July 2011
</div>
END_OF_HTML

my $date;

for my $div ($tree->look_down( _tag => 'div')) {
  for ($div->content_list) {
    $date = $_ unless ref;
  }
}

print "$date\n";

edited Jul 21, 2011 at 16:04

answered Jul 21, 2011 at 13:30

Dave Cross

69.5k3 gold badges55 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ebikeneser Over a year ago

looks good but is there a way around having to hard code the html? In that if I were reading from an html file would I just open 'foo.csv'?

Dave Cross Over a year ago

Sorry, that was just there for demonstration purposes. I assumed that you knew how to parse data with HTML::TreeBuilder. The HTML::TreeBuilder object has a parse_file method (as you'll see in the documentation).

RickF · Accepted Answer · 2011-07-21 12:57:24Z

2

It looks like HTML::Element::content_list() is the function you want. Descendant nodes will be objects while text will just be text, so you can filter with ref() to just get the text part(s).

for ($tree->find('div')) {
  my @content = grep { ! ref } $_->content_list;
  # @content now contains just the bare text portion of the tag
}

answered Jul 21, 2011 at 12:57

RickF

1,82213 silver badges13 bronze badges

Comments

Alan Haggai Alavi · Accepted Answer · 2011-07-21 12:52:44Z

1

You could work around it by removing the text within <strong> from <div>:

my $div      = $tree->look_down( '_tag' => 'div' );
my $div_text = $div->as_trimmed_text;
if ( my $strong = $div->look_down( '_tag' => 'strong' ) ) {
    my $strong_text = $strong->as_trimmed_text;
    my $date        = $div_text;
    $date =~ s/$strong_text\s*//;
}

edited Jul 21, 2011 at 12:52

answered Jul 21, 2011 at 10:18

Alan Haggai Alavi

74.7k19 gold badges105 silver badges129 bronze badges

6 Comments

Ebikeneser Over a year ago

It says that it cant call method on undefined value on the 'my $strong_text = $div->look_down( '_tag' => 'strong' )->as_trimmed_text;' line. Baring in mind this is using a 'for' loop - 'for ( $tree->look_down( '_tag' => 'div')) { ' perhaps that is causing the error?

Alan Haggai Alavi Over a year ago

It should be fine to use look_down in a for loop. Can you please provide a sample of the HTML (with multiple div and strong elements) that you are trying to parse?

Ebikeneser Over a year ago

<div class="leftpremium"><h1>Premium !</h1></div> <p>They are <strong>frozen</strong>.<p>

Alan Haggai Alavi Over a year ago

I have updated my code with a check to see if a <strong> exists within a <div> or not.

Ebikeneser Over a year ago

there seems to be a syntax error stating a mising '}' but it still doesnt seem to pick up what I want.

|

Collectives™ on Stack Overflow

Parse html using Perl

3 Answers 3

2 Comments

Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related