3

if the information of "XYZ 81.6 (-0.1)" needed to be extracted from one html webpage, how can it be done with perl? Many thanks.

<table border="0" width="100%">
          <caption valign="top">
            <p class="InfoContent"><b><br></b>
          </caption>
          <tr>
            <td colspan="3"><p class="InfoContent"><b>ABC</b></td>
          </tr>
          <tr>
            <td valign="top" height="61" width="31%">
              <p class="InfoContent"><b><font color="#0000FF">XYZ 81.6 (-0.1)&nbsp;<br>22/06/2011</font></b></p>
            </td>
          </tr></table>

3 Answers 3

4

I would use HTML::TreeBuilder::XPath for this (and yes, it is a shameless plug!):

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $t= HTML::TreeBuilder::XPath->new_from_file( shift @ARGV);

my $text= $t->findvalue( '//p[@class="InfoContent"]/b/font[@color="#0000FF"]');

$text=~ s{\).*}{)};

print "found '$text'\n";

It is quite fragile though: as far as I can tell the only way to narrow down the XPath expression to just what you want is to use the font tag. That is likely to change in the future, so if (when!) the code breaks, that's where you'll have to look first.

Sign up to request clarification or add additional context in comments.

2 Comments

This is the only answer that actually offers a concrete solution :)
Yep, sorry about that, maybe I should have just linked to the usual stackoverflow.com/questions/1732348/…
0

You can use something like that:

bash-3.2$ perl -MLWP::Simple -le ' $current_value = get("http://stackoverflow.com/questions/6454398/how-to-extract-specific-information-from-html-webpage-using-perl"); if ($current_value=~/(XYZ\s\d+\.\d+\s\(.*?\))/s) { print "Matched pattern is:\t $1";} '
Matched pattern is:      XYZ 81.6 (-0.1)

Comments

0

Mirod's answer is awesome. This being Perl, I'll throw another approach out there.

Let's assume you have the HTML file in input.html. Here's a Perl program which uses the HTML::TreeBuilder module to extract the text:

#!/usr/bin/perl

use 5.10.0 ;
use strict ;
use warnings ;

use HTML::TreeBuilder ;

my $tree = HTML::TreeBuilder -> new () ;

$tree -> parse_file ( 'input.html' ) ;

my $text = ($tree -> address ( '0.1.0.2.0.0.0.1' ) -> content_list ()) [0] ;

say $text ;

Running it:

/tmp/tmp $ ./_extract-a.pl 
XYZ 81.6 (-0.1)�

So how did I come up with that '0.1.0.2.0.0.0.1' magic number? Each node in the tree that results from parsing the HTML file has an "address". The text that you are interested has the address '0.1.0.2.0.0.0.1'.

So, how do you display the node addresses? Here's a little program I call treebuilder-dump; when you pass it an HTML file, it displays it with the nodes labeled:

#!/usr/bin/perl

use 5.10.0 ;
use strict ;
use warnings ;

use HTML::TreeBuilder ;

my $tree = HTML::TreeBuilder->new ;

if ( ! @ARGV == 1 ) { die "No file provided" ; }

if ( ! -f $ARGV[0] ) { die "File does not exist: $ARGV[0]" ; }

$tree->parse_file ( $ARGV[0] ) ;

$tree->dump () ;

$tree->delete () ;

So for example, here's the output when run on your HTML snippet:

<html> @0 (IMPLICIT)
  <head> @0.0 (IMPLICIT)
  <body> @0.1 (IMPLICIT)
    <table border="0" width="100%"> @0.1.0
      <caption valign="top"> @0.1.0.0
        <p class="InfoContent"> @0.1.0.0.0
          <b> @0.1.0.0.0.0
            <br /> @0.1.0.0.0.0.0
      <tr> @0.1.0.1
        <td colspan="3"> @0.1.0.1.0
          <p class="InfoContent"> @0.1.0.1.0.0
            <b> @0.1.0.1.0.0.0
              "ABC"
      <tr> @0.1.0.2
        <td height="61" valign="top" width="31%"> @0.1.0.2.0
          <p class="InfoContent"> @0.1.0.2.0.0
            <b> @0.1.0.2.0.0.0
              " "
              <font color="#0000FF"> @0.1.0.2.0.0.0.1
                "XYZ 81.6 (-0.1)�"
                <br /> @0.1.0.2.0.0.0.1.1
                "22/06/2011"
              " "

You can see that the text you're interested in is located within the font color node which has address 0.1.0.2.0.0.0.1.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.