Perl's HTML::TableExtract does not see all the tables on Pro Football Reference pages

Question

I am trying to extract data from an HTML table with perl, using HTML::TableExtract. Specifically I am trying to grab some rushing stats for the 2024 Baltimore Ravens from Pro Football Referemce. The web page is here:

https://www.pro-football-reference.com/teams/rav/2024.htm

HTML::TableExtract finds four tables on that page. It finds:

Table 0,0: labelled "Team Stats and Rankings" on the web page
Table 0,1: labelled "Schedule & Game Results" on the web page
Table 0,2: labelled "Team Conversions" on the web page
Table 0,3: labelled "Passing on the web page

And – that's it! There are at least 6 or 8 other tables on the page, including the one labelled "Rushing & Receiving" which is what I want. I see those tables in my browser window, and I see them in the page source when I view them. But HTML::TableExtract doesn't seem to notice them.

My code is below. I'm using the HTML::TableExtract->new() constructor with no attributes specified, to grab ALL tables on the web page.

use strict;
#use warnings;
use HTML::TableExtract;
use LWP::Simple;

my $team = 'rav';
my $html_string = 'https://www.pro-football-reference.com/teams/' . $team . '/2024.htm';
print "   Processing  $html_string\n";
print "\n";

my $download = get $html_string ;
my $rowcount = 0;

my $te = HTML::TableExtract->new();
$te->parse($download);

foreach my $ts ($te->tables) {
   print "Table ", join(',', $ts->coords), ":\n";
   $rowcount = 0;
   foreach my $row ($ts->rows) {
       if ($ts->coords > 2) {       # This part is for ouput clarity, to  
          $rowcount++;              # restrict printing to only header rows:
       }                            # 2 rows for the first 3 tables, then one
       if ($rowcount < 2){          # row for any subsequent tables
          print "   ", join(',', @$row), "\n";
       }
      $rowcount++;
   }
}
   
print "\n";

This is the output I get. The script finds four tables; I see 13 in the page source.

   Processing  https://www.pro-football-reference.com/teams/rav/2024.htm

Table 0,0:
   ,,,Tot Yds & TO,,,,,Passing,,,,,,,Rushing,,,,,Penalties,,,,,,Average Drive,,,,
   Player,PF,Yds,Ply,Y/P,TO,FL,1stD,Cmp,Att,Yds,TD,Int,NY/A,1stD,Att,Yds,TD,Y/A,1stD,Pen,Yds,1stPy,#Dr,Sc%,TO%,Start,Time,Plays,Yds,Pts
Table 0,1:
   ,,,,,,,,,,Score,,Offense,,,,,Defense,,,,,Expected Points,,
   Week,Day,Date,,,,OT,Rec,,Opp,Tm,Opp,1stD,TotYd,PassY,RushY,TO,1stD,TotYd,PassY,RushY,TO,Offense,Defense,Sp. Tms
Table 0,2:
   ,Downs,,,,,,Red Zone,,
   Player,3DAtt,3DConv,3D%,4DAtt,4DConv,4D%,RZAtt,RZTD,RZPct
Table 0,3:
   Rk,Player,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,1D,Succ%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,Sk%,NY/A,ANY/A,4QC,GWD,Awards

As you can see, four tables are found. The last one is the "Passing" table on the Pro Football Reference site. That does not match what I see in the page source when I view it in my browser.

I ran a version of this script, printing the downloaded html to a file. That file has 13 tables in it, including this:

<table class="per_match_toggle sortable stats_table" id="rushing_and_receiving" data-cols-to-freeze=",2"> <caption>Rushing & Receiving Table</caption>

That's the table I want! Then a "TR" with headers starting at line 2033 of the saved download. Then a "tbody" starting at line 2076, with "TD"s in it containing the data I want to pull.

How can I access that data in the script?

choroba · Accepted Answer · 2024-09-27 11:47:25Z

2

It seems the other tables are created by JavaScript, they aren't present in the HTML downloaded from the given URL (you can verify it by "viewing the source" of the page in a browser, or by using the following script:

use XML::LibXML;

# Your script goes here

my $dom = 'XML::LibXML'->load_html(string => $download, recover => 2) or die;
my @t = $dom->findnodes('//table');
print "Table tally:", scalar @t, "\n";  # 4

)

If you have Firefox, you can use Firefox::Marionette to let the browser run the JavaScript for you:

#!/usr/bin/perl
use warnings;
use strict;

use Firefox::Marionette;

my $team = 'rav';
my $html_string = 'https://www.pro-football-reference.com/teams/' . $team . '/2024.htm';

my $firefox = 'Firefox::Marionette'->new->go($html_string);
my $i = 0;
for my $table ($firefox->find_tag('table')) {
    ++$i;
}
print $i, "\n";

It says there are 53 tables. Now you can start parsing them by feeding $firefox->html to $te->parse().

WWW::Mechanize::Chrome is another option, if you prefer Chrome to Firefox.

edited Sep 27, 2024 at 11:47

answered Sep 27, 2024 at 8:43

choroba

245k27 gold badges221 silver badges304 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JimZipCode Over a year ago

Thanks for the input. The other tables ARE present in the HTML downloaded from the url. I see them in the page source in my browser; I also printed to a file the output downloaded from the "get" command in this code, and verified it's there too. There's this: <table class="per_match_toggle sortable stats_table" id="rushing_and_receiving" data-cols-to-freeze=",2"> <caption>Rushing & Receiving Table</caption> That's the table I want. Then a whole bunch of "TR"s starting at line 2033. Then a "tbody" starting at line 2076, with TDs in it containing the data I want to pull.

choroba Over a year ago

@JimZipCode: Try using syntax highlighting on the HTML. The tables are there, but they've been commented out. JavaScript probably removes the comments, but it might change the data as well. You might try parsing the comments and see.

Collectives™ on Stack Overflow

Perl's HTML::TableExtract does not see all the tables on Pro Football Reference pages

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related