11

Complete Node.js noob, so dont judge me...

I have a simple requirement. Crawl a web site, find all the product pages, and save some data from the product pages.

Simpler said then done.

Looking at Node.js samples, i cant find something similar.

There a request scraper:

request({uri:'http://www.google.com'}, function (error, response, body) {
  if (!error && response.statusCode == 200) {
    var window = jsdom.jsdom(body).createWindow();
    jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
      // jQuery is now loaded on the jsdom window created from 'body'
      jQuery('.someClass').each(function () { /* Your custom logic */ });
    });
  }
});

But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape.

Then there's the http agent way:

var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);

agent.addListener('next', function (err, agent) {
  var window = jsdom.jsdom(agent.body).createWindow();
  jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
    // jQuery is now loaded on the jsdom window created from 'agent.body'
    jquery('.someClass').each(function () { /* Your Custom Logic */ });

    agent.next();
  });
});

agent.addListener('stop', function (agent) {
  sys.puts('the agent has stopped');
});

agent.start();

Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages.

And i cant even get Apricot working, for some reason i'm getting an error.

So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db?

Thanks!

1

2 Answers 2

12

Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io

More details about scraping can be found in the wiki !


Sign up to request clarification or add additional context in comments.

2 Comments

The best answer for me, simple and fast
thanks. i was searching for node crawl addon on google and found this answer by clicking this question. thanks for sharing. this should be the accepted answer. in the past i did it similar to how the author did it but this is amazing.
8

I've had pretty good success crawling and scraping with Casperjs. It's a pretty nice library built on top of Phantomjs. I like it because it's fairly succinct. Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'.

var casper = require("casper").create();

var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;

casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
    numberOfLinks = this.evaluate(function() {
        return __utils__.findAll('.nav-selector a').length;
    });
    this.echo(numberOfLinks + " items found");

    // cause jquery makes it easier
    casper.page.injectJs('/PATH/TO/jquery.js');
});


// Capture links
capture = function() {
    links = this.evaluate(function() {
        var link = [];
        jQuery('.nav-selector a').each(function() {
            link.push($(this).attr('href'));
        });
        return link;
    });
    this.then(selectLink);
};

You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. The example for scraping BBC photos was exceptionally helpful when I built my scraper.

This is a view from 10,000 feet of what casper can do. It has a very potent and broad API. I dig it, in case you couldn't tell :).

My full scraping example is here: https://gist.github.com/imjared/5201405.

1 Comment

+1 for Casperjs. Your answer led me to try it out and within 3 hours I got a lot done - it's pretty easy to get into.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.