1

I am trying to extract data from an HTML table with perl, using HTML::TableExtract. Specifically I am trying to grab some rushing stats for the 2024 Baltimore Ravens from Pro Football Referemce. The web page is here:

https://www.pro-football-reference.com/teams/rav/2024.htm

HTML::TableExtract finds four tables on that page. It finds:

  • Table 0,0: labelled "Team Stats and Rankings" on the web page
  • Table 0,1: labelled "Schedule & Game Results" on the web page
  • Table 0,2: labelled "Team Conversions" on the web page
  • Table 0,3: labelled "Passing on the web page

And – that's it! There are at least 6 or 8 other tables on the page, including the one labelled "Rushing & Receiving" which is what I want. I see those tables in my browser window, and I see them in the page source when I view them. But HTML::TableExtract doesn't seem to notice them.

My code is below. I'm using the HTML::TableExtract->new() constructor with no attributes specified, to grab ALL tables on the web page.

use strict;
#use warnings;
use HTML::TableExtract;
use LWP::Simple;

my $team = 'rav';
my $html_string = 'https://www.pro-football-reference.com/teams/' . $team . '/2024.htm';
print "   Processing  $html_string\n";
print "\n";

my $download = get $html_string ;
my $rowcount = 0;

my $te = HTML::TableExtract->new();
$te->parse($download);

foreach my $ts ($te->tables) {
   print "Table ", join(',', $ts->coords), ":\n";
   $rowcount = 0;
   foreach my $row ($ts->rows) {
       if ($ts->coords > 2) {       # This part is for ouput clarity, to  
          $rowcount++;              # restrict printing to only header rows:
       }                            # 2 rows for the first 3 tables, then one
       if ($rowcount < 2){          # row for any subsequent tables
          print "   ", join(',', @$row), "\n";
       }
      $rowcount++;
   }
}
   
print "\n";

This is the output I get. The script finds four tables; I see 13 in the page source.

   Processing  https://www.pro-football-reference.com/teams/rav/2024.htm

Table 0,0:
   ,,,Tot Yds & TO,,,,,Passing,,,,,,,Rushing,,,,,Penalties,,,,,,Average Drive,,,,
   Player,PF,Yds,Ply,Y/P,TO,FL,1stD,Cmp,Att,Yds,TD,Int,NY/A,1stD,Att,Yds,TD,Y/A,1stD,Pen,Yds,1stPy,#Dr,Sc%,TO%,Start,Time,Plays,Yds,Pts
Table 0,1:
   ,,,,,,,,,,Score,,Offense,,,,,Defense,,,,,Expected Points,,
   Week,Day,Date,,,,OT,Rec,,Opp,Tm,Opp,1stD,TotYd,PassY,RushY,TO,1stD,TotYd,PassY,RushY,TO,Offense,Defense,Sp. Tms
Table 0,2:
   ,Downs,,,,,,Red Zone,,
   Player,3DAtt,3DConv,3D%,4DAtt,4DConv,4D%,RZAtt,RZTD,RZPct
Table 0,3:
   Rk,Player,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,1D,Succ%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,Sk%,NY/A,ANY/A,4QC,GWD,Awards

As you can see, four tables are found. The last one is the "Passing" table on the Pro Football Reference site. That does not match what I see in the page source when I view it in my browser.

I ran a version of this script, printing the downloaded html to a file. That file has 13 tables in it, including this:

<table class="per_match_toggle sortable stats_table" id="rushing_and_receiving" data-cols-to-freeze=",2"> <caption>Rushing &amp; Receiving Table</caption>

That's the table I want! Then a "TR" with headers starting at line 2033 of the saved download. Then a "tbody" starting at line 2076, with "TD"s in it containing the data I want to pull.

How can I access that data in the script?

1 Answer 1

2

It seems the other tables are created by JavaScript, they aren't present in the HTML downloaded from the given URL (you can verify it by "viewing the source" of the page in a browser, or by using the following script:

use XML::LibXML;

# Your script goes here

my $dom = 'XML::LibXML'->load_html(string => $download, recover => 2) or die;
my @t = $dom->findnodes('//table');
print "Table tally:", scalar @t, "\n";  # 4

)

If you have Firefox, you can use Firefox::Marionette to let the browser run the JavaScript for you:

#!/usr/bin/perl
use warnings;
use strict;

use Firefox::Marionette;

my $team = 'rav';
my $html_string = 'https://www.pro-football-reference.com/teams/' . $team . '/2024.htm';

my $firefox = 'Firefox::Marionette'->new->go($html_string);
my $i = 0;
for my $table ($firefox->find_tag('table')) {
    ++$i;
}
print $i, "\n";

It says there are 53 tables. Now you can start parsing them by feeding $firefox->html to $te->parse().

WWW::Mechanize::Chrome is another option, if you prefer Chrome to Firefox.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the input. The other tables ARE present in the HTML downloaded from the url. I see them in the page source in my browser; I also printed to a file the output downloaded from the "get" command in this code, and verified it's there too. There's this: <table class="per_match_toggle sortable stats_table" id="rushing_and_receiving" data-cols-to-freeze=",2"> <caption>Rushing &amp; Receiving Table</caption> That's the table I want. Then a whole bunch of "TR"s starting at line 2033. Then a "tbody" starting at line 2076, with TDs in it containing the data I want to pull.
@JimZipCode: Try using syntax highlighting on the HTML. The tables are there, but they've been commented out. JavaScript probably removes the comments, but it might change the data as well. You might try parsing the comments and see.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.