Poetry Web Scraping in Python [closed]

Question

Closed. This question is off-topic. It is not currently accepting answers.

Code not implemented or not working as intended: Code Review is a community where programmers peer-review your working code to address issues such as security, maintainability, performance, and scalability. We require that the code be working correctly, to the best of the author's knowledge, before proceeding with a review.

Closed 2 years ago.

Improve this question

I have a script that obtains urls that lead to a specific poem. This code current works and uses multiprocessing pools. I currently am getting restricted or blocked by some way from the website that I am attempting to scrape. How should I edit this code to scrape their site without slowing down their website (how slow should I go?) or is the speed not the problem. The code does work, but only when I go slowly and only run it a few times a day.

The code goes a site "https://www.poetryfoundation.org/poems/browse#page=1&sort_by=recently_added"

and then scrapes it for urls that go out to poems, like: "https://www.poetryfoundation.org/poems/159835/on-naming-yourself-a-cento"

import bs4 as bs
import re

# For simulating the table on the webpage which is dynamically loaded.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver import Chrome 

from multiprocessing import cpu_count
from multiprocessing import Pool
import time

global options
global chrome_service

chrome_path = ChromeDriverManager().install() #Path for my Chrome driver.
options = webdriver.ChromeOptions()
options.add_argument('--headless') # it's more scalable to work in headless mode 
# normally, selenium waits for all resources to download 
# we don't need it as the page also populated with the running javascript code. 
options.page_load_strategy = 'none'
# this returns the path web driver downloaded 
chrome_path = ChromeDriverManager().install()
chrome_service = Service(chrome_path)
# pass the defined options and service objects to initialize the web driver


def scrape_urls(pg_num: int):
    driver = Chrome(options=options, service=chrome_service)
    driver.implicitly_wait(5)

    link = f"https://www.poetryfoundation.org/poems/browse#page={pg_num}&sort_by=recently_added"

    driver.get(link) # load the page

    # give it some time
    driver.implicitly_wait(45)
    time.sleep(10)

    html_source = driver.page_source

    soup = bs.BeautifulSoup(html_source, features="html.parser")

    out_urls = set()

    for aHref in soup.find_all("a",href=re.compile('.*/poems/[0-9]+/.*')):
        out_urls.add(aHref.get("href"))
    
    return out_urls


def write(urls: set):
    # read past urls into set
    prev_urls = set()
    with open('urls.txt', mode='r', encoding='utf-8') as f:
        for line in f.readlines():
            prev_urls.add(line.strip())
    
    print(f'# of old urls: {len(prev_urls)}')
    
    # get urls not already in file
    out_urls = urls.difference(prev_urls)
    print(len(out_urls))
    # append urls out to file
    with open('urls.txt', mode='a', encoding='utf-8') as f:
        for url in out_urls:
            f.write(url+'\n')
    return


def main():
    start = time.time()
    num_urls = 100 # number of urls to scrape from: max is 2341
    num_processes = cpu_count() # number of processes

    out_urls = set()
    with Pool(num_processes) as p:
        for result in p.map(scrape_urls, range(1, num_urls+1)):
            out_urls |= result
    
    write(out_urls)
    
    end = time.time()
    print(f"Time spent: {end-start}")


if __name__ == "__main__":
    main()

This is not a code style / design issue and appears to be off-topic. Other StackExchange sites such as stackoverflow.com would be a better venue for asking about headers, security, ToS, timing schedules, IPv4 proxy networks, and related items. — J_H
– J_H, Commented Mar 13, 2023 at 20:03

NGH · Accepted Answer · 2023-03-13 20:35:19Z

One thing that immediately jumps out at me is the fact you set num_processes = cpu_count() for the multi-processing pool. The absolute maximum I would recommend is cpu-count()-1 but even then I would consider that as high. I do similar web-scraping tasks using multi-processing and I have 16 logical processors and set the Pool at 10. CPU sits at roughly 80% utilisation when I do this and it prevents context switching which can make multi-processing slower than if you had used less processes. This point is more of an FYI for future reference though, I actually don't think you need to use multi-processing at all for this task...

Instead, I strongly advise against using Selenium except for very specific use-cases. Inspecting the webpage and recording the network activity, we can see that there is a GET request which returns a JSON response of all the links you're looking for. See here: https://www.poetryfoundation.org/ajax/poems?page=1&sort_by=recently_added

Fetching the JSON from this link and looping through the pages, you should be able to parse the links you want considerably faster.

Stack Exchange Network

Poetry Web Scraping in Python [closed]

1 Answer 1

Hot Network Questions

Poetry Web Scraping in Python [closed]

1 Answer 1

Related

Hot Network Questions