I'm having an issue writing a basic web crawler. I'd like to write about 500 pages of raw html to files. The problem is my search is either too broad or too narrow. It either goes too deep, and never gets past the first loop, or doesn't go deep enough, and returns nothing.
I've tried playing around with the limit= parameter in find_all(), but am not having any luck with that.
Any advice would be appreciated.
from bs4 import BeautifulSoup
from urllib2 import urlopen
def crawler(seed_url):
to_crawl = [seed_url]
while to_crawl:
page = to_crawl.pop()
if page.startswith("http"):
page_source = urlopen(page)
s = page_source.read()
with open(str(page.replace("/","_"))+".txt","a+") as f:
f.write(s)
f.close()
soup = BeautifulSoup(s)
for link in soup.find_all('a', href=True,limit=5):
# print(link)
a = link['href']
if a.startswith("http"):
to_crawl.append(a)
if __name__ == "__main__":
crawler('http://www.nytimes.com/')
already_crawledsetwithobviates the need toclosethe file explicitly.