0

I'm wanting to extract data from several html pages, but am not familiar with HTML extraction. I have a working code that reads the entire page source and then removes the unwanted parts with regex, however it seems to be quite slow.

I'm reading financial information and only want to extract a single number from each page, so actually don't want to have to read the entire page each time if possible.

This is what I have in Perl:

use LWP::Simple;
my $mult;
my $url = 'http://www.wikinvest.com/stock/Apple_(AAPL)/Data/Net_Income/2014/Q1';

$content = get($url);

$content =~ s/\R//g; # remove linebreaks
$content =~ s/.*\<div class="nv_lefty" id="nv_value">//; # remove everything before tag
$content =~ s/\<.*//g; # remove everything after <...

if ($content =~ s/billion//) {$mult = 1e9;}
elsif ($content =~ s/million//) {$mult = 1e6;}
else {$mult = 1;}

$content =~ s/[^\d.-]//g; # keep numbers, commas and - only
$content = $content * $mult;

The get($url) command is quite slow as it extracts a lot of data, whereas I'm only interested in a single number. Is there a faster way to do this? I looked into HTML::TableExtract but I don't think the number I was extracting is in a standard HTML table. Also not sure if it would be any faster.

4
  • If get($url) part is slow then it's not your code, it's the external website speed (HTTP request/response) that you are dependant on. Commented May 28, 2014 at 21:20
  • 2
    If you need to do many requests like this (ie. fetch 1000s of pages) then the only way to speed it up is probably running multiple instances of your script or threads to have more parallel processes doing requests at the same time. Commented May 28, 2014 at 21:29
  • Do other financial websites have the data you are looking for in a better format? I'm thinking you might be able to get all the info you are looking for in one csv file from Yahoo. What is the specific number you are trying to retrieve? Commented May 28, 2014 at 22:52
  • replacements are totally useless since the data you are looking for is always at the same place: <div class="nv_lefty" id="nv_value">$10.22 billion</div>. You only need to use a DOM query, an XPath query or a regex that matches this specific id. Commented May 29, 2014 at 0:04

1 Answer 1

1

Have a look at Web::Scraper rather than using regexes. Something like this could save you a lot of time and will be less prone to errors.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.