Perl fast HTML extract

Question

I'm wanting to extract data from several html pages, but am not familiar with HTML extraction. I have a working code that reads the entire page source and then removes the unwanted parts with regex, however it seems to be quite slow.

I'm reading financial information and only want to extract a single number from each page, so actually don't want to have to read the entire page each time if possible.

This is what I have in Perl:

use LWP::Simple;
my $mult;
my $url = 'http://www.wikinvest.com/stock/Apple_(AAPL)/Data/Net_Income/2014/Q1';

$content = get($url);

$content =~ s/\R//g; # remove linebreaks
$content =~ s/.*\<div class="nv_lefty" id="nv_value">//; # remove everything before tag
$content =~ s/\<.*//g; # remove everything after <...

if ($content =~ s/billion//) {$mult = 1e9;}
elsif ($content =~ s/million//) {$mult = 1e6;}
else {$mult = 1;}

$content =~ s/[^\d.-]//g; # keep numbers, commas and - only
$content = $content * $mult;

The get($url) command is quite slow as it extracts a lot of data, whereas I'm only interested in a single number. Is there a faster way to do this? I looked into HTML::TableExtract but I don't think the number I was extracting is in a standard HTML table. Also not sure if it would be any faster.

If get($url) part is slow then it's not your code, it's the external website speed (HTTP request/response) that you are dependant on. — Michal Gasek
– Michal Gasek, Commented May 28, 2014 at 21:20
If you need to do many requests like this (ie. fetch 1000s of pages) then the only way to speed it up is probably running multiple instances of your script or threads to have more parallel processes doing requests at the same time. — Michal Gasek
– Michal Gasek, Commented May 28, 2014 at 21:29
Do other financial websites have the data you are looking for in a better format? I'm thinking you might be able to get all the info you are looking for in one csv file from Yahoo. What is the specific number you are trying to retrieve? — bf2020
– bf2020, Commented May 28, 2014 at 22:52
replacements are totally useless since the data you are looking for is always at the same place: <div class="nv_lefty" id="nv_value">$10.22 billion</div>. You only need to use a DOM query, an XPath query or a regex that matches this specific id. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented May 29, 2014 at 0:04

oalders · Accepted Answer · 2014-05-29 02:07:28Z

1

Have a look at Web::Scraper rather than using regexes. Something like this could save you a lot of time and will be less prone to errors.

answered May 29, 2014 at 2:07

oalders

5,2893 gold badges25 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Perl fast HTML extract

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related