1

I started learning Python today and so it is not a surprise that I am struggling with some basics. I am trying to parse data from a school website for a project and I managed to parse the first page. However, there are multiple pages (results are paginated).

I have an idea about how to go about it, ie, run through the urls in a loop since I know the url format but I have no idea how to proceed. I figured it would be better to somehow search for the "next" button and run the function if it is there, if not, then stop function.

I would appreciate any help I can get.

import requests
from  bs4 import BeautifulSoup

url = "http://www.myschoolwebsite.com/1" 
#url2 = "http://www.myschoolwebsite.com/2"
r = requests.get(url)

soup = BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('ul', {"class": "searchResults"})

for item in g_data:
    for li in item.findAll('li'):
        for resultnameh2 in li.findAll('h2'):
            for resultname in resultnameh2.findAll('a'):
                print(resultname).text
    for resultAddress in li.findAll('p', {"class": "resultAddress"}):
        print(resultAddress).text.replace('Get directions','').strip()   
    for resultContact in li.findAll('ul', {"class": "resultContact"}):
        for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
            print(resultContact).text

2 Answers 2

1

First, you can assume the maximum no. of pages of the directory (if you know pattern of the url). I am assuming the url is of the form http://base_url/page Next you can write this:

base_url = 'http://www.myschoolwebsite.com'
total_pages = 100

def parse_content(r):
    soup = BeautifulSoup(r.content,'lxml')
    g_data = soup.find_all('ul', {"class": "searchResults"})

    for item in g_data:
        for li in item.findAll('li'):
            for resultnameh2 in li.findAll('h2'):
                for resultname in resultnameh2.findAll('a'):
                    print(resultname).text
        for resultAddress in li.findAll('p', {"class": "resultAddress"}):
            print(resultAddress).text.replace('Get directions','').strip()   
        for resultContact in li.findAll('ul', {"class": "resultContact"}):
            for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
                print(resultContact).text

for page in range(1, total_pages):
    response = requests.get(base_url + '/' + str(page))
    if response.status_code != 200:
        break

    parse_content(response)
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks! I ran the code but for some reason it is now only outputting resultname.text and omitting all the other fields. Why would it do this?
It shouldn't happen. Print the url of the page being hit so that you know which is giving only resultname.text. Open that url in the browser to see if that is the case. Maybe the html format they are using could be different or there could be no data
it works fine when I use the code I posted though. I am not sure what the issue could be
Is this happening for all the pages?
Probably some issue with the site or something, ill try figure it out. Thanks for all your help, much appreciated :)
|
1

I would make an array with all the URLs and loop through it, or if there is a clear pattern, write a regex to search for that pattern.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.