I am trying to extract data from an HTML table with perl, using HTML::TableExtract. Specifically I am trying to grab some rushing stats for the 2024 Baltimore Ravens from Pro Football Referemce. The web page is here:
https://www.pro-football-reference.com/teams/rav/2024.htm
HTML::TableExtract finds four tables on that page. It finds:
- Table 0,0: labelled "Team Stats and Rankings" on the web page
- Table 0,1: labelled "Schedule & Game Results" on the web page
- Table 0,2: labelled "Team Conversions" on the web page
- Table 0,3: labelled "Passing on the web page
And – that's it! There are at least 6 or 8 other tables on the page, including the one labelled "Rushing & Receiving" which is what I want. I see those tables in my browser window, and I see them in the page source when I view them. But HTML::TableExtract doesn't seem to notice them.
My code is below. I'm using the HTML::TableExtract->new() constructor with no attributes specified, to grab ALL tables on the web page.
use strict;
#use warnings;
use HTML::TableExtract;
use LWP::Simple;
my $team = 'rav';
my $html_string = 'https://www.pro-football-reference.com/teams/' . $team . '/2024.htm';
print " Processing $html_string\n";
print "\n";
my $download = get $html_string ;
my $rowcount = 0;
my $te = HTML::TableExtract->new();
$te->parse($download);
foreach my $ts ($te->tables) {
print "Table ", join(',', $ts->coords), ":\n";
$rowcount = 0;
foreach my $row ($ts->rows) {
if ($ts->coords > 2) { # This part is for ouput clarity, to
$rowcount++; # restrict printing to only header rows:
} # 2 rows for the first 3 tables, then one
if ($rowcount < 2){ # row for any subsequent tables
print " ", join(',', @$row), "\n";
}
$rowcount++;
}
}
print "\n";
This is the output I get. The script finds four tables; I see 13 in the page source.
Processing https://www.pro-football-reference.com/teams/rav/2024.htm
Table 0,0:
,,,Tot Yds & TO,,,,,Passing,,,,,,,Rushing,,,,,Penalties,,,,,,Average Drive,,,,
Player,PF,Yds,Ply,Y/P,TO,FL,1stD,Cmp,Att,Yds,TD,Int,NY/A,1stD,Att,Yds,TD,Y/A,1stD,Pen,Yds,1stPy,#Dr,Sc%,TO%,Start,Time,Plays,Yds,Pts
Table 0,1:
,,,,,,,,,,Score,,Offense,,,,,Defense,,,,,Expected Points,,
Week,Day,Date,,,,OT,Rec,,Opp,Tm,Opp,1stD,TotYd,PassY,RushY,TO,1stD,TotYd,PassY,RushY,TO,Offense,Defense,Sp. Tms
Table 0,2:
,Downs,,,,,,Red Zone,,
Player,3DAtt,3DConv,3D%,4DAtt,4DConv,4D%,RZAtt,RZTD,RZPct
Table 0,3:
Rk,Player,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,1D,Succ%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,Sk%,NY/A,ANY/A,4QC,GWD,Awards
As you can see, four tables are found. The last one is the "Passing" table on the Pro Football Reference site. That does not match what I see in the page source when I view it in my browser.
I ran a version of this script, printing the downloaded html to a file. That file has 13 tables in it, including this:
<table class="per_match_toggle sortable stats_table" id="rushing_and_receiving" data-cols-to-freeze=",2"> <caption>Rushing & Receiving Table</caption>
That's the table I want! Then a "TR" with headers starting at line 2033 of the saved download. Then a "tbody" starting at line 2076, with "TD"s in it containing the data I want to pull.
How can I access that data in the script?