62

I'm looking for an example of requesting a webpage, waiting for the JavaScript to render (JavaScript modifies the DOM), and then grabbing the HTML of the page.

This should be a simple example with an obvious use-case for PhantomJS. I can't find a decent example, the documentation seems to be all about command line use.

6
  • Are you looking to do this client side or server side? Commented Apr 2, 2012 at 12:40
  • 4
    @DeclanCook serverside I think? Clientside would require the user to install phantom right? Which wouldn't work if I understand correctly. Thanks Commented Apr 2, 2012 at 13:07
  • 1
    What are you attempting to do with the html once you have it? Trying to get my head around what you are trying to achieve. Phantomjs has dom manipulation see code.google.com/p/phantomjs/wiki/QuickStart#DOM_Manipulation are you then going to send this somewhere? Commented Apr 2, 2012 at 13:18
  • @DeclanCook the usecase is creating a cached static html copy of a javascript app view for search engines. I want to be able to programmatically run through my sitemap, and save a html version of every link. Commented Apr 2, 2012 at 14:12
  • @DeclanCook yeah that linked page is the sort of thing I need, I just would like an example of how to do it from node. Thanks Commented Apr 2, 2012 at 14:13

6 Answers 6

45

From your comments, I'd guess you have 2 options

  1. Try to find a phantomjs node module - https://github.com/amir20/phantomjs-node
  2. Run phantomjs as a child process inside node - http://nodejs.org/api/child_process.html

Edit:

It seems the child process is suggested by phantomjs as a way of interacting with node, see faq - http://code.google.com/p/phantomjs/wiki/FAQ

Edit:

Example Phantomjs script for getting the pages HTML markup:

var page = require('webpage').create();  
page.open('http://www.google.com', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var p = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].innerHTML
        });
        console.log(p);
    }
    phantom.exit();
});
Sign up to request clarification or add additional context in comments.

6 Comments

Can you show an example? Grab a page, run javascript, get html?
You can simply use 'page.content', there's no need to evaluate anything.
This is great, but... struggling to use require('webpage') in that script wrapped with node because the webpage module is undefined in node, it is in phantom. Has anyone any ideas? Is 'webpage' a common module to both node and phantom? Or can i use require in the phantom context only somehow?
@AdamWaite the evaluation is "sandboxed" and can't execute the require. You would have to pass everything in a closure to the evaluate().
Has anyone been able to run two child processes making phantomjs calls concurrently?
|
8

With v2 of phantomjs-node it's pretty easy to print the HTML after it has been processed.

var phantom = require('phantom');

phantom.create().then(function(ph) {
  ph.createPage().then(function(page) {
    page.open('https://stackoverflow.com/').then(function(status) {
      console.log(status);
      page.property('content').then(function(content) {
        console.log(content);
        page.close();
        ph.exit();
      });
    });
  });
});

This will show the output as it would have been rendered with the browser.

Edit 2019:

You can use async/await:

const phantom = require('phantom');

(async function() {
  const instance = await phantom.create();
  const page = await instance.createPage();
  await page.on('onResourceRequested', function(requestData) {
    console.info('Requesting', requestData.url);
  });

  const status = await page.open('https://stackoverflow.com/');
  const content = await page.property('content');
  console.log(content);

  await instance.exit();
})();

Or if you just want to test, you can use npx

npx phantom@latest https://stackoverflow.com/

1 Comment

does it allow to render HTML given a string ?
4

I've used two different ways in the past, including the page.evaluate() method that queries the DOM that Declan mentioned. The other way I've passed info from the web page is to spit it out to console.log() from there, and in the phantomjs script use:

page.onConsoleMessage = function (msg, line, source) {
  console.log('console [' +source +':' +line +']> ' +msg);
}

I might also trap the variable msg in the onConsoleMessage and search for some encapsulate data. Depends on how you want to use the output.

Then in the Nodejs script, you would have to scan the output of the Phantomjs script:

var yourfunc = function(...params...) {
  var phantom = spawn('phantomjs', [...args]);
  phantom.stdout.setEncoding('utf8');
  phantom.stdout.on('data', function(data) {
    //parse or echo data
    var str_phantom_output = data.toString();
    // The above will get triggered one or more times, so you'll need to
    // add code to parse for whatever info you're expecting from the browser
  });
  phantom.stderr.on('data', function(data) {
    // do something with error data
  });
  phantom.on('exit', function(code) {
    if (code !== 0) {
      // console.log('phantomjs exited with code ' +code);
    } else {
      // clean exit: do something else such as a passed-in callback
    }
  });
}

Hope that helps some.

Comments

3

Why not just use this ?

var page = require('webpage').create();
page.open("http://example.com", function (status)
{
    if (status !== 'success') 
    {
        console.log('FAIL to load the address');            
    } 
    else 
    {
        console.log('Success in fetching the page');
        console.log(page.content);
    }
    phantom.exit();
});

Comments

1

Late update in case anyone stumbles on this question:

A project on GitHub developed by a colleague of mine exactly aims at helping you do that: https://github.com/vmeurisse/phantomCrawl.

It still a bit young, it certainly is missing some documentation, but the example provided should help doing basic crawling.

Comments

1

Here's an old version that I use running node, express and phantomjs which saves out the page as a .png. You could tweak it fairly quickly to get the html.

https://github.com/wehrhaus/sitescrape.git

2 Comments

FYI, if you are going to use a link to provide an answer, it's best to include enough information that your answer won't become useless if the link happens to break at some point in the future.
to save as png you just do page.render('file.png')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.