Simple Python web crawler

Question

I'm following a python tutorial on youtube and got up to where we make a basic web crawler. I tried making my own to do a very simple task. Go to my cities car section on craigslist and print the title/link of every entry, and jump to the next page and repeat if needed. It works for the first page, but won't continue to change pages and get the data. Can someone help explain what's wrong?

import requests
from bs4 import BeautifulSoup

def widow(max_pages):
    page = 0 # craigslist starts at page 0
    while page <= max_pages:
        url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary
        for link in soup.findAll('a', {'class':'hdrlnk'}):
            href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html
            title = link.string
            print(title)
            print(href)
            page += 100 # craigslist pages go 0, 100, 200, etc

widow(0) # 0 gets the first page, replace with multiples of 100 for extra pages

sisanared · Accepted Answer · 2016-09-19 06:19:53Z

2

Looks like you have a problem with your indentation, you need to do page += 100 in the main while block and not inside the for loop.

def widow(max_pages):
    page = 0 # craigslist starts at page 0
    while page <= max_pages:
        url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary
        for link in soup.findAll('a', {'class':'hdrlnk'}):
            href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html
            title = link.string
            print(title)
            print(href)
        page += 100 # craigslist pages go 0, 100, 200, etc

edited Sep 19, 2016 at 6:19

answered Sep 19, 2016 at 5:15

sisanared

4,3072 gold badges30 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Nander Speerstra Over a year ago

Won't this be only a part of the solution? pageis being incremented, but max_pages is set to 0 in the example. And after the first page, 100<=0 will return False and thus exit the loop.

sisanared Over a year ago

OP's comment suggests, he will call window(0) to just get the first page. If he calls window(1000), then he will continue to scrape until page <= 1000

v0dkuh Over a year ago

SSNR is right. The indent was the problem and it fixed everything.

Collectives™ on Stack Overflow

Simple Python web crawler

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related