How can I parse only part of an HTML file and ignore the rest?

Question

In each of 5,000 HTML files I have to get only one line of text, which is line 999. How can I tell the HTML::Parser that I only have to get line 999?

</p><h1>dataset 1:</h1>

&nbsp;<table border="0" bgcolor="#EFEFEF"  leftmargin="15" topmargin="5"><tr>  
<td><strong>name:</strong>&nbsp;</td>  <td width=500> myname one         </td></tr><tr>  
<td><strong>type:</strong>&nbsp;</td>  <td width=500>       type_one  (04313488)        </td></tr><tr>
<td><strong>aresss:</strong>&nbsp;</td><td>Friedrichstr. 70,&nbsp;73430&nbsp;Madrid</td></tr><tr>  
<td><strong>adresse_two:</strong>&nbsp;</td>  <td>          no_value        </td></tr><tr>  
<td><strong>telefone:</strong>&nbsp;</td>  <td>         0000736111/680040        </td></tr><tr>  
<td><strong>Fax:</strong>&nbsp;</td>  <td>          0000736111/680040        </td></tr><tr>  
<td><strong>E-Mail:</strong>&nbsp;</td>  <td>       Keine Angabe        </td></tr><tr>      
<td><strong>Internet:</strong>&nbsp;</td><td><a href="http://www.mysite.es" target="_blank">www.mysite.es</a><br></td></tr><tr> <td><strong>the office:</strong>&nbsp;</td>   
<td><a href="http://www.mysite_two" target="_blank">mysite_two </a><br></td></tr><tr> 
<td><strong>:</strong>&nbsp;</td><td> no_value </td></tr><tr> 
<td><strong>officer:</strong>&nbsp;</td>  <td> no_value        </td>  </td></tr><tr>
<td><strong>employees:</strong>&nbsp;</td>  <td> 259        </td></tr><tr>  
<td><strong>offices:</strong>&nbsp;</td>  <td>     8        </td></tr><tr>  
<td><strong>worker:</strong>&nbsp;</td>  <td>     no_value        </td></tr><tr>  
<td><strong>country:</strong>&nbsp;</td>  <td>    contryname        </td></tr><tr>  
<td><strong>the_council:</strong>&nbsp;</td>  <td>

Well, the question is, is it possible to do the search in the 5000 files with this attribute: That the line 999 is of interest. In other words, can I tell the HTML-parser that it has to look (and extract) exactly line 999?

Hello dear RedGritty Brick - i have little experience with HTML :: TokeParser

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});

print $name->as_text;

BTW; RedGrittyBrick: See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488 in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

love to get hints

possible duplicate of xpather running against HTML-files: defining the paths to prepare a parser-job run Perl's HTML::TokePaser — Sinan Ünür
– Sinan Ünür, Commented Oct 16, 2010 at 14:10
Which line of that HTML is the one you want to extract, or is that all on one line? — brian d foy
– brian d foy, Commented Oct 16, 2010 at 19:20

user229044 · Accepted Answer · 2010-10-16 17:32:29Z

1

Do you mean the 999th line or the 999th table row?

The former might be

perl -ne 'print if $. == 999' /path/to/*.dat

The latter would involve an HTML parser and some selection logic. A Sax parser might be better for fast processing of a large number of files. It probably depends which version of HTML is used and whether it is "well-formed".

Perl has many XML and HTML parsers - did you have any particular module in mind?

EDIT:

Your problem seems to be your XPath expression. The actual HTML is much more complex than your XPath suggests. The following expression works better

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

#
# replace this with a loop over 5000 existing files
#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
#
# within the loop process the html like this
#
$tree->parse($html);
$tree->eof;
print $tree->findvalue('//table[@bgcolor]/tr[1]');

Try cutting the above and pasting into a file then running it with Perl.

edited Oct 16, 2010 at 17:32

user229044♦

241k41 gold badges347 silver badges350 bronze badges

answered Oct 16, 2010 at 0:02

RedGrittyBrick

4,1003 gold badges38 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

zero Over a year ago

use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); print $name->as_text; the example sites: kultusportal-bw.de/servlet/PB/menu/1188427/… in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 HTML-files -

zero Over a year ago

Hello RedGrittyBrick: Guess that i now understand your code. You did the trick with the color! YOu solved the issue with working with the grey shadowed color! Is this true!? GREAT Job! I am overwhelmed. Congrats. Greetings Martin

brian d foy Over a year ago

If you're going to show code, please update your question. Forcing people to read code in comments is cruel.

zero Over a year ago

hello brian d foy - thx for the posting. I agree. Being a novice i have to learn alot! - greetings

RedGrittyBrick Over a year ago

@Martin, yes - the HTML has several tables, therefore specify a table attribute that uniquely identifies which table you are interested in. I found it worth reading the W3C tutorials on XPath expressions.

Collectives™ on Stack Overflow

How can I parse only part of an HTML file and ignore the rest?

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related