1

I have the following HTML-

<div>
   <strong>Date: </strong>
       19 July 2011
</div>

I have been using HTML::TreeBuilder to parse out particular parts of html that are using either tags or classes however the aforementioned html is giving me difficulty in trying to extract the date only.

For instance I tried-

for ( $tree->look_down( '_tag' => 'div'))
{ 
my $date  = $_->look_down( '_tag' => 'strong' )->as_trimmed_text;

But that seems to conflict with an earlier use of <strong>. I am looking to parse out just the '19 July 2011'. I have read the documentation on TreeBuilder but can not find a way of doing this.

How can I do this using TreeBuilder?

3 Answers 3

3

The "dump" method is invaluable in finding your way around an HTML::TreeBuilder object.

The solution here is to get the parent element of the element you're interested in (which is, in this case, the <div>) and iterate across its content list. The text you're interested in will be plain text nodes, i.e. elements in the list that are not references to HTML::Element objects.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

$tree->parse(<<END_OF_HTML);
<div>
   <strong>Date: </strong>
       19 July 2011
</div>
END_OF_HTML

my $date;

for my $div ($tree->look_down( _tag => 'div')) {
  for ($div->content_list) {
    $date = $_ unless ref;
  }
}

print "$date\n";
Sign up to request clarification or add additional context in comments.

2 Comments

looks good but is there a way around having to hard code the html? In that if I were reading from an html file would I just open 'foo.csv'?
Sorry, that was just there for demonstration purposes. I assumed that you knew how to parse data with HTML::TreeBuilder. The HTML::TreeBuilder object has a parse_file method (as you'll see in the documentation).
2

It looks like HTML::Element::content_list() is the function you want. Descendant nodes will be objects while text will just be text, so you can filter with ref() to just get the text part(s).

for ($tree->find('div')) {
  my @content = grep { ! ref } $_->content_list;
  # @content now contains just the bare text portion of the tag
}

Comments

1

You could work around it by removing the text within <strong> from <div>:

my $div      = $tree->look_down( '_tag' => 'div' );
my $div_text = $div->as_trimmed_text;
if ( my $strong = $div->look_down( '_tag' => 'strong' ) ) {
    my $strong_text = $strong->as_trimmed_text;
    my $date        = $div_text;
    $date =~ s/$strong_text\s*//;
}

6 Comments

It says that it cant call method on undefined value on the 'my $strong_text = $div->look_down( '_tag' => 'strong' )->as_trimmed_text;' line. Baring in mind this is using a 'for' loop - 'for ( $tree->look_down( '_tag' => 'div')) { ' perhaps that is causing the error?
It should be fine to use look_down in a for loop. Can you please provide a sample of the HTML (with multiple div and strong elements) that you are trying to parse?
<div class="leftpremium"><h1>Premium &#33;</h1></div> <p>They are <strong>frozen</strong>.<p>
I have updated my code with a check to see if a <strong> exists within a <div> or not.
there seems to be a syntax error stating a mising '}' but it still doesnt seem to pick up what I want.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.