1

I created a Python script that reads URLs from a text file and uses the URLs in a for loop to gather similar information. The URLs are all from the same website. Below is roughly what the Python code looks like.

for url in urls:
    x = scrape(url)
    if has_changed(x):
        notify_me()
    else:
        continue

Unfortunately this scraper does not work on some websites, since the websites blocks most scrapers, so I am forced to use the Node js Puppetteer Stealth library, which I am not very familiar with, since the Python Pyppeteer is blocked (along with Selenium, requests, requests-html etc.)

I am trying to implement the synchronous approach of Python in Node JS but I am struggling to. This is my implementation for it so far...

const puppeteer = require("puppeteer-extra");

// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const { url } = require("inspector");
puppeteer.use(StealthPlugin());

// puppeteer usage as normal
async function sad(url, number) {
  puppeteer.launch({ headless: true }).then(async (browser) => {
    console.log("Running tests..");
    const page = await browser.newPage();
    await page.goto(url);
    await page.screenshot({ path: "test" + number + ".png", fullPage: true });
    await browser.close();
    console.log(`All done, check the screenshot. ✨`);
  });
}

var urls = [
  "https://www.example1.com",
  "https://www.example2.com",
  "https://www.example3.com",
];

function iHateNode() {
  for (i in urls) {
    sad(urls[i], i);
  }
}

iHateNode();

I have created a data structure made from an array with an object for each kind of like the following.

allUrls = [
    {
        url: "https://example1.org"
        extraInfo : "1234"
    }

    {
        url: "https://example1.org"
        extraInfo : "789"
    }

    {
        url: "https://example1.org"
        extraInfo : "987"
    }
    ...

]

The intention is that I want to loop through each url in the allUrls array and make a call to the Puppetteer scraper. If the information changed I want to be notified and then change the relevant information in the allUrls to reflect this new change (so I am not constantly notified for it). I am not sure if the asynchronous nature of Node JS will cause problems two functions are attempting to change the allUrls array at the same time. I am not entirely sure if making it synchronous is the best approach in this case here either, but at least it shouldn't cause errors like the one previously mentioned.

1 Answer 1

1

Just iterate the urls inside of the async function how about?:

puppeteer.launch({ headless: true }).then(async (browser) => {
  console.log("Running tests..");
  const page = await browser.newPage();
  for(let url of urls){
    await page.goto(url);
    await page.screenshot({ path: "test" + number + ".png", fullPage: true });        
    console.log(`All done, check the screenshot. ✨`);
  }
  await browser.close();
});
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.