0

I'm following a python tutorial on youtube and got up to where we make a basic web crawler. I tried making my own to do a very simple task. Go to my cities car section on craigslist and print the title/link of every entry, and jump to the next page and repeat if needed. It works for the first page, but won't continue to change pages and get the data. Can someone help explain what's wrong?

import requests
from bs4 import BeautifulSoup

def widow(max_pages):
    page = 0 # craigslist starts at page 0
    while page <= max_pages:
        url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary
        for link in soup.findAll('a', {'class':'hdrlnk'}):
            href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html
            title = link.string
            print(title)
            print(href)
            page += 100 # craigslist pages go 0, 100, 200, etc

widow(0) # 0 gets the first page, replace with multiples of 100 for extra pages
0

1 Answer 1

2

Looks like you have a problem with your indentation, you need to do page += 100 in the main while block and not inside the for loop.

def widow(max_pages):
    page = 0 # craigslist starts at page 0
    while page <= max_pages:
        url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary
        for link in soup.findAll('a', {'class':'hdrlnk'}):
            href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html
            title = link.string
            print(title)
            print(href)
        page += 100 # craigslist pages go 0, 100, 200, etc
Sign up to request clarification or add additional context in comments.

3 Comments

Won't this be only a part of the solution? pageis being incremented, but max_pages is set to 0 in the example. And after the first page, 100<=0 will return False and thus exit the loop.
OP's comment suggests, he will call window(0) to just get the first page. If he calls window(1000), then he will continue to scrape until page <= 1000
SSNR is right. The indent was the problem and it fixed everything.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.