0

recently started getting into web scraping and i have managed ok but now im stuck and i cant find the answer or figure it out.
Here is my code for scraping and exporting info from a single page

import requests
page = requests.get("https://www.example.com/page.aspx?sign=1")

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]

#finds the right paragraph to grab
reading = soup.find_all('p')[0].text

print (heading, reading)

import csv
from datetime import datetime

# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
 writer = csv.writer(csv_file)
 writer.writerow([heading, reading, datetime.now()])

Problem occurs when i try to scrape multiple pages at the same time. They are all the same just pagination changes eg

Instead of writing the same code 20 times how do i stick all the data in a tuple or an array and export to csv. Many thanks in advance.

1
  • You could append the results to a list or 3 lists (heading, reading, date) after every page, or you could read in into a Pandas Dataframe, and the write the whole thing to a csv. I'm not sure which would be better/faster, but for 20 iterations it doesn't matter too much. Commented Aug 7, 2020 at 9:43

1 Answer 1

2

Just try it out with a loop, until you got no page available (request is not OK). Should be easy to get.

import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime

results = []
page_number = 1

while True:
    response = requests.get(f"https://www.example.com/page.aspx?sign={page_number}")
    if response.status_code != 200:
        break
    soup = BeautifulSoup(page.content, 'html.parser')
    #finds the right heading to grab
    box = soup.find('h1').text
    heading = box.split()[0]
    #finds the right paragraph to grab
    reading = soup.find_all('p')[0].text
    # write a list
    # results.append([heading, reading, datetime.now()])
    # or tuple.. your call
    results.append((heading, reading, datetime.now()))
    page_number = page_number + 1

with open('index.csv', 'a') as csv_file:
 writer = csv.writer(csv_file)
 for result in results:
    writer.writerow(result)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.