How can I extract data from HTML tables in Perl?

Question

I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:

<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>

Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.

I've tried many regular expressions such as:

$lines =~ / (<td>) /
       OR
$lines =~ / <td>(.*)< /
       OR
$lines =~ / >(.*)< /

My current program looks like so:

#!perl -w

open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;

open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;

my $lines = join '', <INPUT_FILE>;

print "Hello 2\n";

if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
    if ($lines =~ / <td>(.*)< / ) {
    print "this is the 2nd 1: \n";
    print $1;
    print "the word was: $1.\n";
    $Time = $1;
    print $Time;
    print OUTPUT_FILE $Time;
    } else {
    print "2ND IF FAILED\n";
    }
} else { 
print "THIS FAILED\n";
}

close(INPUT_FILE);
close(OUTPUT_FILE);

@Kinopiko: Close enough. What's the difference between wanting to extract portions between td tags and li tags? — Ken White
– Ken White, Commented Oct 30, 2009 at 19:39
By the way, you seemed to be confused about your task: The text you are trying to parse is within tags. The strings you want are marked up, so to speak. — Sinan Ünür
– Sinan Ünür, Commented Oct 30, 2009 at 19:47

Community · Accepted Answer · 2017-05-23 12:32:25Z

18

Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.

edited May 23, 2017 at 12:32

CommunityBot

11 silver badge

answered Oct 30, 2009 at 17:42

Ether

54.2k13 gold badges91 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user181548 Over a year ago

In this case the requested parsing is rather simple though.

Sinan Ünür Over a year ago

@Ether It seems to me some people enjoy torturing themselves. I don't know why.

Ether Over a year ago

@Sinan: My theory is that there is a special kind of learning curve with regexes: at first they seem so mind-blowing that there's nothing they can't (or shouldn't) do. Anything that looks like a parsing problem therefore must be solvable with regexes.

Sinan Ünür · Accepted Answer · 2009-10-30 19:43:13Z

11

Use HTML::TableExtract. Really.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use LWP::Simple;

my $file = 'Table3.htm';
unless ( -e $file ) {
    my $rc = getstore(
        'http://www.ntsb.gov/aviation/Table3.htm',
        $file);
    die "Failed to download document\n" unless $rc == 200;
}

my @headers = qw( Year Fatalities );

my $te = HTML::TableExtract->new(
    headers => \@headers,
    attribs => { id => 'myTable' },
);

$te->parse_file($file);

my ($table) = $te->tables;

print join("\t", @headers), "\n";

for my $row ($te->rows ) {
    print join("\t", @$row), "\n";
}

This is what I meant in another post by "task-specific" HTML parsers.

You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.

answered Oct 30, 2009 at 19:43

Sinan Ünür

118k15 gold badges201 silver badges347 bronze badges

3 Comments

Ashley Over a year ago

I know I’m very late to this party but the getstore() is a very nice touch to avoid hammering on someone’s server. Great sample code.

Sue Mynott Over a year ago

I voted this up because you provided working code, even though I was tempted not to because you couldn't resist lecturing the OP at the end. Knowing which documentation to read is not all that easy.

Sinan Ünür Over a year ago

@SueSpence Thank you for the upvote, but people who just keep throwing one pattern after another at HTML documents whose format is not under their control do need to be reminded there are better solutions. May I recommend that you add working code to your answer on the topic, instead of lecturing me about not lecturing others?

score 0 · Accepted Answer · 2009-10-30 18:12:23Z

0

That's an easy one:

my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my @stuff = $html =~ />([^<]+)</g;
print join (", ", @stuff), "\n";

See http://codepad.org/qz9d5Bro if you want to try running it.

edited Oct 30, 2009 at 18:12

answered Oct 30, 2009 at 18:06

user181548

6 Comments

user181548 Over a year ago

Wait until you see the DOWNVOTES I get for telling you this.

Sinan Ünür Over a year ago

@nick because this is the kind of approach that will keep one wasting a lot more time and effort again and again always looking for just the right regex each time one needs to parse HTML.

user181548 Over a year ago

Parsing JSON with regular expressions is just as hard as parsing HTML, and yet one of the people on a previous discussion, stackoverflow.com/questions/1598053/…, who was most dogmatic about not using regexes for parsing HTML then went on to approve of a solution to a problem which involved using regexes to parse JSON: stackoverflow.com/questions/1636352/….

Sinan Ünür Over a year ago

Well, I cannot speak for others. I do think using regular expressions was a waste of time in that case as well. So, I added a Perl one liner using JSON.pm to that thread.

daotoad Over a year ago

@Kinopiko, it appears that too few people on SO understand the Chomsky Hierarchy. Parsing JSON with regexes is foolish, even moreso than HTML, since a real parser is available that is so much simpler to use than any half-assed regex solution could ever hope to be. This demonstrates the value of CS in educating programmers.

|

Collectives™ on Stack Overflow

How can I extract data from HTML tables in Perl?

3 Answers 3

3 Comments

3 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related