7

I need a comandline tool (or Javascript/PHP, but i think commandline is the one way) for render and get the rendered content of URL, but the important its I need to renderer the Javascript not only the CSS/Html/images.

For example command like: "renderengine http://www.google.es outputfile.html" and the content of the web (parsed html and javascript executed) isa saved in outputfile.html.

I need this because i need to take the result of a full javascript website like grooveshark, the site load all using javascript/ajax and the crawlers dont find nothing, only basic HTML empty template (because is loaded after using ajax/javscript)

Exists any browser engine for linux with support to Javascript (for example V8) that output the result for save in files?

1
  • 1
    I'm not sure 'render' is the word I would use here if you want to save it as html, as render is more associated with taking code and outputting pictures or sounds. What you want is more related to saving the temporary modifications of the html/css/js state then a render of anything. Commented Sep 10, 2013 at 13:32

2 Answers 2

10
  • Selenium : very complete solution with bindings in many languages
  • puppeteer : headless Chrome API, usable in NodeJS or as a command-line tool
  • HTtrack : command-line tool
  • Apache Notch & webmagic : open source Java web crawlers
  • pholcus : "distributed & high concurrency" web crawler written in Go
  • Xvfb a display server implementing the X11 display server protocol, without showing any screen output. I have used it successfully with Travis CI and Protractor as an example. Alternative: XDummy
  • PhantomJS (first suggested by nvuono) : can export the rendered page as non-HTML (pdf, png...). PhantomJS development is suspended until further notice (more details). Closely related: SlimerJS, CasperJS

And there are many Python web scraping libraries:

Sign up to request clarification or add additional context in comments.

2 Comments

Hi there from 2018. Are there any new tools available? PhantomJS website is not accessible, Xvfb last update 2010 - looks outdated. Thanks!
@J.T. I updated the links to reflect a little more the current options in 2018. PhantomJS is still used a lot, I think their website is just temporarily down
6

Try phantomjs from www.phantomjs.org and you can easily modify the included rasterize.js to export the rendered HTML. It's based on webkit and does full evaluation of your target site's javascript, allowing you to adjust timeouts or execute your own code first if you wish. I personally use it to save hardcopy HTML file version of fully-rendered knockout.js templates.

It executes javascript so I just did something like this and saved the console output to a file:

var markup = page.evaluate(function(){return document.documentElement.innerHTML;});
console.log(markup);
phantom.exit();

2 Comments

Im try it, and work perfect!! var page = require('webpage').create(); page.open('page.es', function (status) { if (status !== 'success') { console.log('Unable to load the address!'); phantom.exit(); } else { window.setTimeout(function () { page.viewportSize = { width: 1000, height: 800 }; console.log(page.content); phantom.exit(); }, 400); } }); The timeout its need because the ajax load after certain second animation, but, its perfect!
CasperJS is a suspended/dead project

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.