Parsing multiple urls with Python and BeautifulSoup

Question

I started learning Python today and so it is not a surprise that I am struggling with some basics. I am trying to parse data from a school website for a project and I managed to parse the first page. However, there are multiple pages (results are paginated).

I have an idea about how to go about it, ie, run through the urls in a loop since I know the url format but I have no idea how to proceed. I figured it would be better to somehow search for the "next" button and run the function if it is there, if not, then stop function.

I would appreciate any help I can get.

import requests
from  bs4 import BeautifulSoup

url = "http://www.myschoolwebsite.com/1" 
#url2 = "http://www.myschoolwebsite.com/2"
r = requests.get(url)

soup = BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('ul', {"class": "searchResults"})

for item in g_data:
    for li in item.findAll('li'):
        for resultnameh2 in li.findAll('h2'):
            for resultname in resultnameh2.findAll('a'):
                print(resultname).text
    for resultAddress in li.findAll('p', {"class": "resultAddress"}):
        print(resultAddress).text.replace('Get directions','').strip()   
    for resultContact in li.findAll('ul', {"class": "resultContact"}):
        for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
            print(resultContact).text

aash · Accepted Answer · 2016-12-12 11:30:15Z

1

First, you can assume the maximum no. of pages of the directory (if you know pattern of the url). I am assuming the url is of the form http://base_url/page Next you can write this:

base_url = 'http://www.myschoolwebsite.com'
total_pages = 100

def parse_content(r):
    soup = BeautifulSoup(r.content,'lxml')
    g_data = soup.find_all('ul', {"class": "searchResults"})

    for item in g_data:
        for li in item.findAll('li'):
            for resultnameh2 in li.findAll('h2'):
                for resultname in resultnameh2.findAll('a'):
                    print(resultname).text
        for resultAddress in li.findAll('p', {"class": "resultAddress"}):
            print(resultAddress).text.replace('Get directions','').strip()   
        for resultContact in li.findAll('ul', {"class": "resultContact"}):
            for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
                print(resultContact).text

for page in range(1, total_pages):
    response = requests.get(base_url + '/' + str(page))
    if response.status_code != 200:
        break

    parse_content(response)

edited Dec 12, 2016 at 11:30

answered Dec 12, 2016 at 11:24

aash

1,3231 gold badge14 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user5563910 Over a year ago

Thanks! I ran the code but for some reason it is now only outputting resultname.text and omitting all the other fields. Why would it do this?

aash Over a year ago

It shouldn't happen. Print the url of the page being hit so that you know which is giving only resultname.text. Open that url in the browser to see if that is the case. Maybe the html format they are using could be different or there could be no data

user5563910 Over a year ago

it works fine when I use the code I posted though. I am not sure what the issue could be

aash Over a year ago

Is this happening for all the pages?

user5563910 Over a year ago

Probably some issue with the site or something, ill try figure it out. Thanks for all your help, much appreciated :)

|

isopach · Accepted Answer · 2016-12-12 11:07:35Z

1

I would make an array with all the URLs and loop through it, or if there is a clear pattern, write a regex to search for that pattern.

answered Dec 12, 2016 at 11:07

isopach

1,9487 gold badges33 silver badges46 bronze badges

Collectives™ on Stack Overflow

Parsing multiple urls with Python and BeautifulSoup

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related