Cannot Write Web Crawler in Python

Question

I'm having an issue writing a basic web crawler. I'd like to write about 500 pages of raw html to files. The problem is my search is either too broad or too narrow. It either goes too deep, and never gets past the first loop, or doesn't go deep enough, and returns nothing.

I've tried playing around with the limit= parameter in find_all(), but am not having any luck with that.

Any advice would be appreciated.

from bs4 import BeautifulSoup
from urllib2 import urlopen

def crawler(seed_url):
    to_crawl = [seed_url]
    while to_crawl:
        page = to_crawl.pop()
        if page.startswith("http"):
            page_source = urlopen(page)
            s = page_source.read()

            with open(str(page.replace("/","_"))+".txt","a+") as f:
                f.write(s)
                f.close()
            soup = BeautifulSoup(s)
            for link in soup.find_all('a', href=True,limit=5):
                # print(link)
                a = link['href']
                if a.startswith("http"):
                    to_crawl.append(a)

if __name__ == "__main__":
    crawler('http://www.nytimes.com/')

"The problem is my search is either too broad or too narrow" -- can you elaborate, please? — 1.618
– 1.618, Commented Mar 5, 2015 at 1:41
Not a problem. I added this: It either goes too deep, and never gets past the first loop, or doesn't go deep enough, and returns nothing. — Adam_G
– Adam_G, Commented Mar 5, 2015 at 1:43
Minor point in terms of the core functionality, but with obviates the need to close the file explicitly. — khampson
– khampson, Commented Mar 5, 2015 at 2:09
@C.B. I implemented your suggestion, but I couldn't see your comment on the answer edit page -- just making sure credit is given. — 1.618
– 1.618, Commented Mar 5, 2015 at 3:32

1.618 · Accepted Answer · 2015-03-05 03:31:41Z

3

I modified your function so it doesn't write to file, it just prints the urls, and this is what I got:

http://www.nytimes.com/
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/

So it looks like your could would work, but there's a redirect loop. Maybe try rewriting this as a recursive function so you're doing a depth-first search instead of a breadth-first search, which I'm pretty sure is what is happening now.

EDIT: here's a recursive function:

def recursive_crawler(url, crawled):
    if len(crawled) >= 500:
        return
    print url
    page_source = urlopen(page)
    s = page_source.read()

    #write to file here, if desired

    soup = BeautifulSoup(s)
    for link in soup.find_all('a', href=True):
        a = link['href']
        if a != url and a.startswith("http") and a not in crawled:
            crawled.add(a)
            recursive_crawler(a, crawled)

Pass it an empty set for crawled:

c = set()
recursive_crawler('http://www.nytimes.com', c)

output (I interrupted it after a few seconds):

http://www.nytimes.com
http://www.nytimes.com/content/help/site/ie8-support.html
http://international.nytimes.com
http://cn.nytimes.com
http://www.nytimes.com/
http://www.nytimes.com/pages/todayspaper/index.html
http://www.nytimes.com/video
http://www.nytimes.com/pages/world/index.html
http://www.nytimes.com/pages/national/index.html
http://www.nytimes.com/pages/politics/index.html
http://www.nytimes.com/pages/nyregion/index.html
http://www.nytimes.com/pages/business/index.html

Thanks to whoever it was who suggested using an already_crawled set

edited Mar 5, 2015 at 3:31

answered Mar 5, 2015 at 2:04

1.618

1,76517 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Wayne Werner Over a year ago

This practically begs for a recursive function. And if you're only worried about 500 pages you're not going to hit the max recursion depth.

Adam_G Over a year ago

I'm not sure I follow. What would the recursive function look like?

1.618 Over a year ago

I'll write up a possible recursive function and edit it into the answer

Collectives™ on Stack Overflow

Cannot Write Web Crawler in Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related