0
\$\begingroup\$

I have a script that obtains urls that lead to a specific poem. This code current works and uses multiprocessing pools. I currently am getting restricted or blocked by some way from the website that I am attempting to scrape. How should I edit this code to scrape their site without slowing down their website (how slow should I go?) or is the speed not the problem. The code does work, but only when I go slowly and only run it a few times a day.

The code goes a site "https://www.poetryfoundation.org/poems/browse#page=1&sort_by=recently_added"

and then scrapes it for urls that go out to poems, like: "https://www.poetryfoundation.org/poems/159835/on-naming-yourself-a-cento"

import bs4 as bs
import re

# For simulating the table on the webpage which is dynamically loaded.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver import Chrome 

from multiprocessing import cpu_count
from multiprocessing import Pool
import time

global options
global chrome_service

chrome_path = ChromeDriverManager().install() #Path for my Chrome driver.
options = webdriver.ChromeOptions()
options.add_argument('--headless') # it's more scalable to work in headless mode 
# normally, selenium waits for all resources to download 
# we don't need it as the page also populated with the running javascript code. 
options.page_load_strategy = 'none'
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install()
chrome_service = Service(chrome_path)
# pass the defined options and service objects to initialize the web driver


def scrape_urls(pg_num: int):
    driver = Chrome(options=options, service=chrome_service)
    driver.implicitly_wait(5)

    link = f"https://www.poetryfoundation.org/poems/browse#page={pg_num}&sort_by=recently_added"

    driver.get(link) # load the page

    # give it some time
    driver.implicitly_wait(45)
    time.sleep(10)

    html_source = driver.page_source

    soup = bs.BeautifulSoup(html_source, features="html.parser")

    out_urls = set()

    for aHref in soup.find_all("a",href=re.compile('.*/poems/[0-9]+/.*')):
        out_urls.add(aHref.get("href"))
    
    return out_urls


def write(urls: set):
    # read past urls into set
    prev_urls = set()
    with open('urls.txt', mode='r', encoding='utf-8') as f:
        for line in f.readlines():
            prev_urls.add(line.strip())
    
    print(f'# of old urls: {len(prev_urls)}')
    
    # get urls not already in file
    out_urls = urls.difference(prev_urls)
    print(len(out_urls))
    # append urls out to file
    with open('urls.txt', mode='a', encoding='utf-8') as f:
        for url in out_urls:
            f.write(url+'\n')
    return


def main():
    start = time.time()
    num_urls = 100 # number of urls to scrape from: max is 2341
    num_processes = cpu_count() # number of processes

    out_urls = set()
    with Pool(num_processes) as p:
        for result in p.map(scrape_urls, range(1, num_urls+1)):
            out_urls |= result
    
    write(out_urls)
    
    end = time.time()
    print(f"Time spent: {end-start}")


if __name__ == "__main__":
    main()
\$\endgroup\$
1
  • 3
    \$\begingroup\$ This is not a code style / design issue and appears to be off-topic. Other StackExchange sites such as stackoverflow.com would be a better venue for asking about headers, security, ToS, timing schedules, IPv4 proxy networks, and related items. \$\endgroup\$ Commented Mar 13, 2023 at 20:03

1 Answer 1

1
\$\begingroup\$

One thing that immediately jumps out at me is the fact you set num_processes = cpu_count() for the multi-processing pool. The absolute maximum I would recommend is cpu-count()-1 but even then I would consider that as high. I do similar web-scraping tasks using multi-processing and I have 16 logical processors and set the Pool at 10. CPU sits at roughly 80% utilisation when I do this and it prevents context switching which can make multi-processing slower than if you had used less processes. This point is more of an FYI for future reference though, I actually don't think you need to use multi-processing at all for this task...

Instead, I strongly advise against using Selenium except for very specific use-cases. Inspecting the webpage and recording the network activity, we can see that there is a GET request which returns a JSON response of all the links you're looking for. See here: https://www.poetryfoundation.org/ajax/poems?page=1&sort_by=recently_added

Fetching the JSON from this link and looping through the pages, you should be able to parse the links you want considerably faster.

\$\endgroup\$
0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.