How to scrape multiple pages with requests in python

Question

recently started getting into web scraping and i have managed ok but now im stuck and i cant find the answer or figure it out.
Here is my code for scraping and exporting info from a single page

import requests
page = requests.get("https://www.example.com/page.aspx?sign=1")

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]

#finds the right paragraph to grab
reading = soup.find_all('p')[0].text

print (heading, reading)

import csv
from datetime import datetime

# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
 writer = csv.writer(csv_file)
 writer.writerow([heading, reading, datetime.now()])

Problem occurs when i try to scrape multiple pages at the same time. They are all the same just pagination changes eg

Instead of writing the same code 20 times how do i stick all the data in a tuple or an array and export to csv. Many thanks in advance.

You could append the results to a list or 3 lists (heading, reading, date) after every page, or you could read in into a Pandas Dataframe, and the write the whole thing to a csv. I'm not sure which would be better/faster, but for 20 iterations it doesn't matter too much. — Beek
– Beek, Commented Aug 7, 2020 at 9:43

tgdraugr · Accepted Answer · 2020-08-07 09:51:13Z

Just try it out with a loop, until you got no page available (request is not OK). Should be easy to get.

import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime

results = []
page_number = 1

while True:
    response = requests.get(f"https://www.example.com/page.aspx?sign={page_number}")
    if response.status_code != 200:
        break
    soup = BeautifulSoup(page.content, 'html.parser')
    #finds the right heading to grab
    box = soup.find('h1').text
    heading = box.split()[0]
    #finds the right paragraph to grab
    reading = soup.find_all('p')[0].text
    # write a list
    # results.append([heading, reading, datetime.now()])
    # or tuple.. your call
    results.append((heading, reading, datetime.now()))
    page_number = page_number + 1

with open('index.csv', 'a') as csv_file:
 writer = csv.writer(csv_file)
 for result in results:
    writer.writerow(result)

Collectives™ on Stack Overflow

How to scrape multiple pages with requests in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related