1

I am using the following code to scrape the website. The following which I tried works fine for a page in the website. Now I want to scrape several such pages for which I am looping the URL as shown below.

from bs4 import BeautifulSoup
import urllib2
import csv
import re
number = 2500
for i in xrange(2500,7000):
    page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    soup = BeautifulSoup(page.read())
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '\n'
        number = number + 1

The following is the normal code without loop

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id=4591")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
    print re.sub(r'\s+',' ',''.join(eachuniversity.findAll(text=True)).encode('utf-8'))

I am looping the id value in the URL from 2500 to 7000. But there are many id's for which there is no value. So there are no such pages. How do I skip those pages and scrape data only when there exists data for given id.

2 Answers 2

2

you can either try catch the result (https://stackoverflow.com/questions/6092992/why-is-it-easier-to-ask-forgiveness-than-permission-in-python-but-not-in-java):

for i in xrange(2500,7000):
    try:
        page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    except:
        continue
    else:
        soup = BeautifulSoup(page.read())
        for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
            print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
            print '\n'
            number = number + 1

or use a (great) lib such as requests and check before scrapping

import requests
for i in xrange(2500,7000):
    page = requests.get("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    if not page.ok:
        continue
    soup = BeautifulSoup(requests.text)
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '\n'
        number = number + 1

basically there's no way for you to know if the page with that id exists before calling the url.

Sign up to request clarification or add additional context in comments.

Comments

0

try to find an index page on that site, otherwise, you simply can't tell before trying to reach the URL

6 Comments

what does that have to do with this? I have list of URLs, I wanna skip the URL if it does not exits. But sorry. I dont get you what exactly you mean.
most websites has some way of looping (paging) over existing records (ids in your case) or other way of reaching / searching, otherwise, this pages will not be accessible to their users... most spiders / fetchers will use those "meta" pages to cover the entire set, first step would run over the index page and next step would scrape the pages it points to, check out projects like scrapy.org maybe even use it :) sorry if Im not following your intention...
Yes. I understand. But I dont think it is the same situation here. Because I am able to access any particular URL for that id, I guess.
I know you do :) Im just saying you should not run in a blind loop over all ids, just as users of that site would not run like this, let your spider use the site as users would, have it browse those pages like a potential user is expected to, investigate the site structure, look for pagination / browse page
SO is funny sometimes, the above answer suggest hitting a few thousands 404 on the site, IMO this is bad from at least 10 different reasons
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.