1

I currently have a nodejs based web scraper that utilities the puppetteer module. While it does work, it is very slow, since I have made it in such a way that it uses a synchronous approach instead of an asynchronous one.

The basic logic of the program in pseudo code is as follows:

async fucntion main():
    
   ......

    while true:
        for url in listOfUrls:
            await scrapeInformation()
            if there is a change:
                sendNotification()

The problem with this approach is that I can not begin the scraping of another page until the current page has been scraped. I would like to begin the loading of the next webpages, so that they are ready to be scraped once their turn comes in the for loop. However, I still want to be able to limit the number of webpages open for scraping, so that I do not run into any memory errors, since I ran into that issue in a previous implementation of this script where I was launching instances of the chromium browser much faster than the program was able to close them.

The scrapeInformation() looks a bit like this:

async function scrapeInformation(url, browser) {
    const browser = await puppeteer.launch({headless: true});
    const page = await browser.newPage();
    let response = await page.goto(url);

    let data = await page.evaluate(() => {

        blah blah blah
        
        return {blah, blah};
    });

    await page.close();
    
    return data
}

I believe a good place to start would be to perhaps to rescrape another URL at the let data = await page.evaluate(() => { line, but I am unsure as how to implement such logic.

1
  • I am using puppeteer because it has a plugin that I need for my monitor to work on a particular site. I also need to be able to execute JS on the website, which cheerio does not do. Commented Aug 25, 2020 at 17:17

1 Answer 1

1

If I understand correctly, you need to check some URL set infinitely round with limited concurrency. You need not to open and close browsers for this, it has unneeded overhead. Just create a pool with n pages (where n = concurrency limit) and reuse them with a portion of URLs. You can shift this portion of URLs and push it to the end of set for infinite cycle. For example:

'use strict';

const puppeteer = require('puppeteer');

const urls = [
  'https://example.org/?test=1',
  'https://example.org/?test=2',
  'https://example.org/?test=3',
  'https://example.org/?test=4',
  'https://example.org/?test=5',
  'https://example.org/?test=6',
  'https://example.org/?test=7',
  'https://example.org/?test=8',
  'https://example.org/?test=9',
  'https://example.org/?test=10',
];
const concurrencyLimit = 3;
const restartAfterNCycles = 5;

(async function main() {
  for (;;) await cycles();
})();

async function cycles() {
  try {
    const browser = await puppeteer.launch({ headless: false, defaultViewport: null });

    await Promise.all(Array.from(
      Array(concurrencyLimit - 1), // Because one page is already opened.
      () => browser.newPage()
    ));
    const pagePool = await browser.pages();

    let cycleCounter = restartAfterNCycles;
    while (cycleCounter--) {
      const cycleUrls = urls.slice();
      let urlsPart;
      while ((urlsPart = cycleUrls.splice(0, concurrencyLimit)).length) {
        console.log(`\nProcessing concurrently:\n${urlsPart.join('\n')}\n`);
        await Promise.all(urlsPart.map((url, i) => scrape(pagePool[i], url)));
      }
      console.log(`\nCycles to do: ${cycleCounter}`);
    }

    return browser.close();
  } catch (err) {
    console.error(err);
  }
}

async function scrape(page, url) {
  await page.goto(url);
  const data = await page.evaluate(() => document.location.href);
  console.log(`${data} done.`);
}
Sign up to request clarification or add additional context in comments.

6 Comments

I would need to occasionally close the browser and reopen it, because the website seems to deny access when you have been using the same browser session for too long. How would I implement this with the above. Are there any online resources I should check out to help me understand this better?
Also, your implementation is much better than what I currently have, but it is still a little different from what I am looking for. In the above implementation, n pages are open at a time and once all have completed loading then n more pages are open and so on. Instead, I want a new page to be open as soon as a page has finished loading while still remaining under n (concurrency limit)
Sorry, I do not know any such resources( I've edited the example with browser restart after some cycles.
And, sorry, I cannot understand the second comment. My implementation does not open new pages after the previous ones are loaded, it uses the same ones.
Sorry, page is the wrong word. I mean that your script visits n number of URLs and waits for all of them to finish and then moves on to n more URLs and so on and so on. I am looking for a script that loads a new URL as soon as a previous URL has been loaded and evaluated, while still ensuring that a maximum number of pages (n) are open at one time, so it does not run into memory issues by opening many pages.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.