0

I would like to scrape web pages that load content dynamically with Javascript or similar.

Something like a headless browser, that I could use on a Linux Shared Host without X.

I can use PHP, Perl, Ruby or Python.

Do any of you know some framework/headless browser that can help me?

Thank you very much.

2
  • 1
    possible duplicate of headless internet browser? Commented Jul 20, 2012 at 15:56
  • 1
    Is there any reason you can't get an inexpensive VPS and install whatever you want on it? Shared hosting is usually a terrible place to run intensive operations like this. Commented Jul 20, 2012 at 16:07

3 Answers 3

1

Try Selenium to control the browser if you need to simulate key presses or clicks in order to get the content to load.

For a headless browser, there are some listed here: headless internet browser?

Sign up to request clarification or add additional context in comments.

Comments

1

See library WWW::Scripter

Synopsis:

use WWW::Scripter;

$w = new WWW::Scripter;
$w->use_plugin('Javascript');
$w->get('http://some.site.com/that/uses/javascript');
$w->content; # returns the HTML content, possibly modified by scripts
$w->eval('alert("Hello from JavaScript")');
$w->document->getElementsByTagName('div')->[0]->...

Comments

-2

Using Perl WWW::Mechanize in Perl. This module has numerous methods that can perform web browser like functions. Below is a Sample code:

use WWW::Mechanize;
use strict;

my $username = "admin";
my $password = "welcome1";  
my $outpath  = "/home/data/output";
my $fromday = 7;
my $url  = "https://www.myreports.com/tax_report.php";
my $name = "tax_report";
my $outfile = "$outpath/$name.html";

my $mech = WWW::Mechanize->new(noproxy =>'0');  

$mech->get($url);
$mech->field(login => "$username");
$mech->field(passwd => "$password");

$mech->add_handler("request_send",  sub { shift->dump; return });
$mech->add_handler("response_done", sub { shift->dump; return });

$mech->click_button(value=>"Login now");

my $response = $mech->content();

print "Generating report: $name...\n";

open (OUT, ">>$outfile")|| die "Cannot create report file $outfile";
print OUT "$response";
close OUT;

In-case you want to handle Javascripts in the web-page (which you want to scrape), you can have a look at WWW::Mechanize::Firefox, but this may require installing the MozRepl plugin for Mozilla.

1 Comment

W:M:Firefox is not a headless browser, it very much requires X.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.