Loop URL to scrape using beautiful soup python

Question

I am using the following code to scrape the website. The following which I tried works fine for a page in the website. Now I want to scrape several such pages for which I am looping the URL as shown below.

from bs4 import BeautifulSoup
import urllib2
import csv
import re
number = 2500
for i in xrange(2500,7000):
    page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    soup = BeautifulSoup(page.read())
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '\n'
        number = number + 1

The following is the normal code without loop

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id=4591")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
    print re.sub(r'\s+',' ',''.join(eachuniversity.findAll(text=True)).encode('utf-8'))

I am looping the id value in the URL from 2500 to 7000. But there are many id's for which there is no value. So there are no such pages. How do I skip those pages and scrape data only when there exists data for given id.

Community · Accepted Answer · 2017-05-23 11:50:08Z

you can either try catch the result (https://stackoverflow.com/questions/6092992/why-is-it-easier-to-ask-forgiveness-than-permission-in-python-but-not-in-java):

for i in xrange(2500,7000):
    try:
        page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    except:
        continue
    else:
        soup = BeautifulSoup(page.read())
        for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
            print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
            print '\n'
            number = number + 1

or use a (great) lib such as requests and check before scrapping

import requests
for i in xrange(2500,7000):
    page = requests.get("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    if not page.ok:
        continue
    soup = BeautifulSoup(requests.text)
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '\n'
        number = number + 1

basically there's no way for you to know if the page with that id exists before calling the url.

Guy Gavriely · Accepted Answer · 2013-11-12 18:18:35Z

0

try to find an index page on that site, otherwise, you simply can't tell before trying to reach the URL

answered Nov 12, 2013 at 18:18

Guy Gavriely

11.4k6 gold badges31 silver badges43 bronze badges

6 Comments

Venkateshwaran Selvaraj Over a year ago

what does that have to do with this? I have list of URLs, I wanna skip the URL if it does not exits. But sorry. I dont get you what exactly you mean.

Guy Gavriely Over a year ago

most websites has some way of looping (paging) over existing records (ids in your case) or other way of reaching / searching, otherwise, this pages will not be accessible to their users... most spiders / fetchers will use those "meta" pages to cover the entire set, first step would run over the index page and next step would scrape the pages it points to, check out projects like scrapy.org maybe even use it :) sorry if Im not following your intention...

Venkateshwaran Selvaraj Over a year ago

Yes. I understand. But I dont think it is the same situation here. Because I am able to access any particular URL for that id, I guess.

Guy Gavriely Over a year ago

I know you do :) Im just saying you should not run in a blind loop over all ids, just as users of that site would not run like this, let your spider use the site as users would, have it browse those pages like a potential user is expected to, investigate the site structure, look for pagination / browse page

Guy Gavriely Over a year ago

SO is funny sometimes, the above answer suggest hitting a few thousands 404 on the site, IMO this is bad from at least 10 different reasons

|

Collectives™ on Stack Overflow

Loop URL to scrape using beautiful soup python

2 Answers 2

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related