0

In each of 5,000 HTML files I have to get only one line of text, which is line 999. How can I tell the HTML::Parser that I only have to get line 999?

</p><h1>dataset 1:</h1>

&nbsp;<table border="0" bgcolor="#EFEFEF"  leftmargin="15" topmargin="5"><tr>  
<td><strong>name:</strong>&nbsp;</td>  <td width=500> myname one         </td></tr><tr>  
<td><strong>type:</strong>&nbsp;</td>  <td width=500>       type_one  (04313488)        </td></tr><tr>
<td><strong>aresss:</strong>&nbsp;</td><td>Friedrichstr. 70,&nbsp;73430&nbsp;Madrid</td></tr><tr>  
<td><strong>adresse_two:</strong>&nbsp;</td>  <td>          no_value        </td></tr><tr>  
<td><strong>telefone:</strong>&nbsp;</td>  <td>         0000736111/680040        </td></tr><tr>  
<td><strong>Fax:</strong>&nbsp;</td>  <td>          0000736111/680040        </td></tr><tr>  
<td><strong>E-Mail:</strong>&nbsp;</td>  <td>       Keine Angabe        </td></tr><tr>      
<td><strong>Internet:</strong>&nbsp;</td><td><a href="http://www.mysite.es" target="_blank">www.mysite.es</a><br></td></tr><tr> <td><strong>the office:</strong>&nbsp;</td>   
<td><a href="http://www.mysite_two" target="_blank">mysite_two </a><br></td></tr><tr> 
<td><strong>:</strong>&nbsp;</td><td> no_value </td></tr><tr> 
<td><strong>officer:</strong>&nbsp;</td>  <td> no_value        </td>  </td></tr><tr>
<td><strong>employees:</strong>&nbsp;</td>  <td> 259        </td></tr><tr>  
<td><strong>offices:</strong>&nbsp;</td>  <td>     8        </td></tr><tr>  
<td><strong>worker:</strong>&nbsp;</td>  <td>     no_value        </td></tr><tr>  
<td><strong>country:</strong>&nbsp;</td>  <td>    contryname        </td></tr><tr>  
<td><strong>the_council:</strong>&nbsp;</td>  <td> 

Well, the question is, is it possible to do the search in the 5000 files with this attribute: That the line 999 is of interest. In other words, can I tell the HTML-parser that it has to look (and extract) exactly line 999?


Hello dear RedGritty Brick - i have little experience with HTML :: TokeParser

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});

print $name->as_text;

BTW; RedGrittyBrick: See one of the example sites: http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488 in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

love to get hints

2

1 Answer 1

1

Do you mean the 999th line or the 999th table row?

The former might be

perl -ne 'print if $. == 999' /path/to/*.dat

The latter would involve an HTML parser and some selection logic. A Sax parser might be better for fast processing of a large number of files. It probably depends which version of HTML is used and whether it is "well-formed".

Perl has many XML and HTML parsers - did you have any particular module in mind?


EDIT:

Your problem seems to be your XPath expression. The actual HTML is much more complex than your XPath suggests. The following expression works better

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;

#
# replace this with a loop over 5000 existing files
#
my $url = 'http://www.kultusportal-bw.de/'.
          'servlet/PB/menu/1188427/index.html'.
          '?COMPLETEHREF='.
          'http://www.kultus-bw.de/'.
          'did_abfrage/detail.php?id=04313488';
my $html = get $url;

my $tree = HTML::TreeBuilder::XPath->new();
#
# within the loop process the html like this
#
$tree->parse($html);
$tree->eof;
print $tree->findvalue('//table[@bgcolor]/tr[1]');

Try cutting the above and pasting into a file then running it with Perl.

Sign up to request clarification or add additional context in comments.

5 Comments

use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); print $name->as_text; the example sites: kultusportal-bw.de/servlet/PB/menu/1188427/… in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 HTML-files -
Hello RedGrittyBrick: Guess that i now understand your code. You did the trick with the color! YOu solved the issue with working with the grey shadowed color! Is this true!? GREAT Job! I am overwhelmed. Congrats. Greetings Martin
If you're going to show code, please update your question. Forcing people to read code in comments is cruel.
hello brian d foy - thx for the posting. I agree. Being a novice i have to learn alot! - greetings
@Martin, yes - the HTML has several tables, therefore specify a table attribute that uniquely identifies which table you are interested in. I found it worth reading the W3C tutorials on XPath expressions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.