I currently have a nodejs based web scraper that utilities the puppetteer module. While it does work, it is very slow, since I have made it in such a way that it uses a synchronous approach instead of an asynchronous one.
The basic logic of the program in pseudo code is as follows:
async fucntion main():
......
while true:
for url in listOfUrls:
await scrapeInformation()
if there is a change:
sendNotification()
The problem with this approach is that I can not begin the scraping of another page until the current page has been scraped. I would like to begin the loading of the next webpages, so that they are ready to be scraped once their turn comes in the for loop. However, I still want to be able to limit the number of webpages open for scraping, so that I do not run into any memory errors, since I ran into that issue in a previous implementation of this script where I was launching instances of the chromium browser much faster than the program was able to close them.
The scrapeInformation() looks a bit like this:
async function scrapeInformation(url, browser) {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
let response = await page.goto(url);
let data = await page.evaluate(() => {
blah blah blah
return {blah, blah};
});
await page.close();
return data
}
I believe a good place to start would be to perhaps to rescrape another URL at the let data = await page.evaluate(() => { line, but I am unsure as how to implement such logic.