1

I'm having an issue writing a basic web crawler. I'd like to write about 500 pages of raw html to files. The problem is my search is either too broad or too narrow. It either goes too deep, and never gets past the first loop, or doesn't go deep enough, and returns nothing.

I've tried playing around with the limit= parameter in find_all(), but am not having any luck with that.

Any advice would be appreciated.

from bs4 import BeautifulSoup
from urllib2 import urlopen

def crawler(seed_url):
    to_crawl = [seed_url]
    while to_crawl:
        page = to_crawl.pop()
        if page.startswith("http"):
            page_source = urlopen(page)
            s = page_source.read()

            with open(str(page.replace("/","_"))+".txt","a+") as f:
                f.write(s)
                f.close()
            soup = BeautifulSoup(s)
            for link in soup.find_all('a', href=True,limit=5):
                # print(link)
                a = link['href']
                if a.startswith("http"):
                    to_crawl.append(a)

if __name__ == "__main__":
    crawler('http://www.nytimes.com/')
7
  • "The problem is my search is either too broad or too narrow" -- can you elaborate, please? Commented Mar 5, 2015 at 1:41
  • Not a problem. I added this: It either goes too deep, and never gets past the first loop, or doesn't go deep enough, and returns nothing. Commented Mar 5, 2015 at 1:43
  • 2
    Might want to look into an already_crawled set Commented Mar 5, 2015 at 2:05
  • 1
    Minor point in terms of the core functionality, but with obviates the need to close the file explicitly. Commented Mar 5, 2015 at 2:09
  • 1
    @C.B. I implemented your suggestion, but I couldn't see your comment on the answer edit page -- just making sure credit is given. Commented Mar 5, 2015 at 3:32

1 Answer 1

3

I modified your function so it doesn't write to file, it just prints the urls, and this is what I got:

http://www.nytimes.com/
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/
http://international.nytimes.com
http://cn.nytimes.com
http://cn.nytimes.com/register/?redirect_url=http://cn.nytimes.com/

So it looks like your could would work, but there's a redirect loop. Maybe try rewriting this as a recursive function so you're doing a depth-first search instead of a breadth-first search, which I'm pretty sure is what is happening now.

EDIT: here's a recursive function:

def recursive_crawler(url, crawled):
    if len(crawled) >= 500:
        return
    print url
    page_source = urlopen(page)
    s = page_source.read()

    #write to file here, if desired

    soup = BeautifulSoup(s)
    for link in soup.find_all('a', href=True):
        a = link['href']
        if a != url and a.startswith("http") and a not in crawled:
            crawled.add(a)
            recursive_crawler(a, crawled)

Pass it an empty set for crawled:

c = set()
recursive_crawler('http://www.nytimes.com', c)

output (I interrupted it after a few seconds):

http://www.nytimes.com
http://www.nytimes.com/content/help/site/ie8-support.html
http://international.nytimes.com
http://cn.nytimes.com
http://www.nytimes.com/
http://www.nytimes.com/pages/todayspaper/index.html
http://www.nytimes.com/video
http://www.nytimes.com/pages/world/index.html
http://www.nytimes.com/pages/national/index.html
http://www.nytimes.com/pages/politics/index.html
http://www.nytimes.com/pages/nyregion/index.html
http://www.nytimes.com/pages/business/index.html

Thanks to whoever it was who suggested using an already_crawled set

Sign up to request clarification or add additional context in comments.

3 Comments

This practically begs for a recursive function. And if you're only worried about 500 pages you're not going to hit the max recursion depth.
I'm not sure I follow. What would the recursive function look like?
I'll write up a possible recursive function and edit it into the answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.