Save and render a webpage with PhantomJS and node.js

Question

I'm looking for an example of requesting a webpage, waiting for the JavaScript to render (JavaScript modifies the DOM), and then grabbing the HTML of the page.

This should be a simple example with an obvious use-case for PhantomJS. I can't find a decent example, the documentation seems to be all about command line use.

@DeclanCook serverside I think? Clientside would require the user to install phantom right? Which wouldn't work if I understand correctly. Thanks — Harry
– Harry, Commented Apr 2, 2012 at 13:07
What are you attempting to do with the html once you have it? Trying to get my head around what you are trying to achieve. Phantomjs has dom manipulation see code.google.com/p/phantomjs/wiki/QuickStart#DOM_Manipulation are you then going to send this somewhere? — Declan Cook
– Declan Cook, Commented Apr 2, 2012 at 13:18
@DeclanCook the usecase is creating a cached static html copy of a javascript app view for search engines. I want to be able to programmatically run through my sitemap, and save a html version of every link. — Harry
– Harry, Commented Apr 2, 2012 at 14:12
@DeclanCook yeah that linked page is the sort of thing I need, I just would like an example of how to do it from node. Thanks — Harry
– Harry, Commented Apr 2, 2012 at 14:13

Amir Raminfar · Accepted Answer · 2019-04-01 23:17:41Z

45

From your comments, I'd guess you have 2 options

Try to find a phantomjs node module - https://github.com/amir20/phantomjs-node
Run phantomjs as a child process inside node - http://nodejs.org/api/child_process.html

Edit:

It seems the child process is suggested by phantomjs as a way of interacting with node, see faq - http://code.google.com/p/phantomjs/wiki/FAQ

Edit:

Example Phantomjs script for getting the pages HTML markup:

var page = require('webpage').create();  
page.open('http://www.google.com', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var p = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].innerHTML
        });
        console.log(p);
    }
    phantom.exit();
});

edited Apr 1, 2019 at 23:17

Amir Raminfar

34.3k8 gold badges97 silver badges125 bronze badges

answered Apr 2, 2012 at 14:20

Declan Cook

6,1362 gold badges37 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Harry Over a year ago

Can you show an example? Grab a page, run javascript, get html?

JLarky Over a year ago

You can simply use 'page.content', there's no need to evaluate anything.

Adam Waite Over a year ago

This is great, but... struggling to use require('webpage') in that script wrapped with node because the webpage module is undefined in node, it is in phantom. Has anyone any ideas? Is 'webpage' a common module to both node and phantom? Or can i use require in the phantom context only somehow?

Josh C. Over a year ago

@AdamWaite the evaluation is "sandboxed" and can't execute the require. You would have to pass everything in a closure to the evaluate().

Josh C. Over a year ago

Has anyone been able to run two child processes making phantomjs calls concurrently?

|

Amir Raminfar · Accepted Answer · 2019-04-01 23:20:35Z

8

With v2 of phantomjs-node it's pretty easy to print the HTML after it has been processed.

var phantom = require('phantom');

phantom.create().then(function(ph) {
  ph.createPage().then(function(page) {
    page.open('https://stackoverflow.com/').then(function(status) {
      console.log(status);
      page.property('content').then(function(content) {
        console.log(content);
        page.close();
        ph.exit();
      });
    });
  });
});

This will show the output as it would have been rendered with the browser.

Edit 2019:

You can use async/await:

const phantom = require('phantom');

(async function() {
  const instance = await phantom.create();
  const page = await instance.createPage();
  await page.on('onResourceRequested', function(requestData) {
    console.info('Requesting', requestData.url);
  });

  const status = await page.open('https://stackoverflow.com/');
  const content = await page.property('content');
  console.log(content);

  await instance.exit();
})();

Or if you just want to test, you can use npx

npx phantom@latest https://stackoverflow.com/

edited Apr 1, 2019 at 23:20

answered Mar 15, 2016 at 18:26

Amir Raminfar

34.3k8 gold badges97 silver badges125 bronze badges

1 Comment

Yuriy Kravets Over a year ago

does it allow to render HTML given a string ?

ultrageek · Accepted Answer · 2012-05-31 20:21:08Z

I've used two different ways in the past, including the page.evaluate() method that queries the DOM that Declan mentioned. The other way I've passed info from the web page is to spit it out to console.log() from there, and in the phantomjs script use:

page.onConsoleMessage = function (msg, line, source) {
  console.log('console [' +source +':' +line +']> ' +msg);
}

I might also trap the variable msg in the onConsoleMessage and search for some encapsulate data. Depends on how you want to use the output.

Then in the Nodejs script, you would have to scan the output of the Phantomjs script:

var yourfunc = function(...params...) {
  var phantom = spawn('phantomjs', [...args]);
  phantom.stdout.setEncoding('utf8');
  phantom.stdout.on('data', function(data) {
    //parse or echo data
    var str_phantom_output = data.toString();
    // The above will get triggered one or more times, so you'll need to
    // add code to parse for whatever info you're expecting from the browser
  });
  phantom.stderr.on('data', function(data) {
    // do something with error data
  });
  phantom.on('exit', function(code) {
    if (code !== 0) {
      // console.log('phantomjs exited with code ' +code);
    } else {
      // clean exit: do something else such as a passed-in callback
    }
  });
}

Hope that helps some.

yossi · Accepted Answer · 2013-12-18 16:07:38Z

3

Why not just use this ?

var page = require('webpage').create();
page.open("http://example.com", function (status)
{
    if (status !== 'success') 
    {
        console.log('FAIL to load the address');            
    } 
    else 
    {
        console.log('Success in fetching the page');
        console.log(page.content);
    }
    phantom.exit();
});

answered Dec 18, 2013 at 16:07

yossi

13.4k28 gold badges91 silver badges112 bronze badges

Comments

Stilltorik · Accepted Answer · 2013-06-26 16:10:54Z

1

Late update in case anyone stumbles on this question:

A project on GitHub developed by a colleague of mine exactly aims at helping you do that: https://github.com/vmeurisse/phantomCrawl.

It still a bit young, it certainly is missing some documentation, but the example provided should help doing basic crawling.

answered Jun 26, 2013 at 16:10

Stilltorik

1,7024 gold badges19 silver badges31 bronze badges

Comments

user2950147 · Accepted Answer · 2014-04-26 03:18:32Z

1

Here's an old version that I use running node, express and phantomjs which saves out the page as a .png. You could tweak it fairly quickly to get the html.

https://github.com/wehrhaus/sitescrape.git

answered Apr 26, 2014 at 3:18

user2950147

192 bronze badges

2 Comments

Rob Watts Over a year ago

FYI, if you are going to use a link to provide an answer, it's best to include enough information that your answer won't become useless if the link happens to break at some point in the future.

Toolkit Over a year ago

to save as png you just do page.render('file.png')

Collectives™ on Stack Overflow

Save and render a webpage with PhantomJS and node.js

6 Answers 6

6 Comments

1 Comment

Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

6 Comments

1 Comment

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related