I created a Python script that reads URLs from a text file and uses the URLs in a for loop to gather similar information. The URLs are all from the same website. Below is roughly what the Python code looks like.
for url in urls:
x = scrape(url)
if has_changed(x):
notify_me()
else:
continue
Unfortunately this scraper does not work on some websites, since the websites blocks most scrapers, so I am forced to use the Node js Puppetteer Stealth library, which I am not very familiar with, since the Python Pyppeteer is blocked (along with Selenium, requests, requests-html etc.)
I am trying to implement the synchronous approach of Python in Node JS but I am struggling to. This is my implementation for it so far...
const puppeteer = require("puppeteer-extra");
// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const { url } = require("inspector");
puppeteer.use(StealthPlugin());
// puppeteer usage as normal
async function sad(url, number) {
puppeteer.launch({ headless: true }).then(async (browser) => {
console.log("Running tests..");
const page = await browser.newPage();
await page.goto(url);
await page.screenshot({ path: "test" + number + ".png", fullPage: true });
await browser.close();
console.log(`All done, check the screenshot. ✨`);
});
}
var urls = [
"https://www.example1.com",
"https://www.example2.com",
"https://www.example3.com",
];
function iHateNode() {
for (i in urls) {
sad(urls[i], i);
}
}
iHateNode();
I have created a data structure made from an array with an object for each kind of like the following.
allUrls = [
{
url: "https://example1.org"
extraInfo : "1234"
}
{
url: "https://example1.org"
extraInfo : "789"
}
{
url: "https://example1.org"
extraInfo : "987"
}
...
]
The intention is that I want to loop through each url in the allUrls array and make a call to the Puppetteer scraper. If the information changed I want to be notified and then change the relevant information in the allUrls to reflect this new change (so I am not constantly notified for it). I am not sure if the asynchronous nature of Node JS will cause problems two functions are attempting to change the allUrls array at the same time. I am not entirely sure if making it synchronous is the best approach in this case here either, but at least it shouldn't cause errors like the one previously mentioned.