0

Trying to make a simple Tumblr scraper using node.js

var request = require('request');
var fs = require('fs');
var apiKey = 'my-key-here';
var offset = 0;

for (var i=0; i<5; i++) {
  console.log('request #' + i + '...');

  var requestURL = 'http://api.tumblr.com/v2/blog/blog.tumblr.com/posts/text?api_key='
    + apiKey
    + '&offset='
    + offset;

  console.log(requestURL);

  request(requestURL, function(error, response, body) {
    if (!error && response.statusCode == 200) {
      var resultAsJSON = JSON.parse(body);
      resultAsJSON.response.posts.forEach(function(obj) {
        fs.appendFile('content.txt', offset + ' ' + obj.title + '\n', function (err) {
          if (err) return console.log(err);
        });   
        offset++;  
      });       
    }
  }); 
}

By default, the API only returns a maximum of 20 latest posts. I want to grab all the posts instead. As a test, I want to get the latest 100 first, hence the i<5in the loop declaration.

The trick to do it is to use the offset parameter. Given an offset value of 20, for example, the API will not return the latest 20, but instead returns posts starting from the 21st from the top.

As I can't be sure that the API will always return 20 posts, I am using offset++ to get the correct offset number.

The code above works, but console.log(requestURL) returns http://api.tumblr.com/v2/blog/blog.tumblr.com/posts/text?api_key=my-key-here&offset=0 five times.

So my question is, why does the offset value in my requestURL remains as 0, even though I have added offset++?

4
  • Not this again. You fire off a request and expect it to complete before the loop goes to the next iteration. The requests doesn't even get started until after the loop completes which is why offset is zero for all of them. You need an asynchronous for-each loop. Commented Jan 29, 2014 at 8:58
  • The thing is I'm writing the offset variable in appendFile, and they showed up correctly in the text file from 0 to 99. Commented Jan 29, 2014 at 9:01
  • That's just due to the callbacks to the requests occurring in the same sequence they were fired in but that is not guaranteed and you should not depend on it. Commented Jan 29, 2014 at 9:04
  • I understand what you mean. I thought this is some sort of variable scoping gotcha that I'm unaware of, but now I'm not so sure. Commented Jan 29, 2014 at 9:08

1 Answer 1

1

You should increment the offset in the loop, not in callbacks. Callbacks fire only after the request has been completed, which means you make five requests with offset = 0 and it's incremented after you get a response.

  var requestURL = 'http://api.tumblr.com/v2/blog/blog.tumblr.com/posts/text?api_key='
    + apiKey
    + '&offset='
    + (offset++); // increment here, before passing URL to request();

Edit: To offset by 20 in each iteration, and use the offset in callback:

for (var i=0; i<5; i++) {
var offset = i * 20, requestURL = 'http://api.tumblr.com/v2/blog/blog.tumblr.com/posts/text?api_key='
    + apiKey
    + '&offset='
    + offset;

    (function(off){ 
        request(requestURL, function(error, response, body) {
            if (!error && response.statusCode == 200) {
                var resultAsJSON = JSON.parse(body);
                resultAsJSON.response.posts.forEach(function(obj) {
                    fs.appendFile('content.txt', off + ' ' + obj.title + '\n', function (err) {
                        if (err) return console.log(err);
                    });   
                    off++;  
                });       
            }
        });
    }(offset)); // pass the offset from loop to a closure
}
Sign up to request clarification or add additional context in comments.

5 Comments

There's a difference between i and offset, though. offset needs to be the number of posts to be skipped, while i is the number of times I want to use the API. 1 API call gives up to 20 posts at once.
But in your question you increment offset by one in each iteration, so I've assumed offset == i. Use offset = i * 20 then.
offset++ is inside the forEach scope, which iterates all of the posts returned by the API. So I'm counting the posts there.
Tried running it. Got an error: }(offset)); // pass the offset from loop to a closure SyntaxError: Unexpected token }
forgot ); in the line before }(offset));

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.