Python simple web crawler error (infinite loop crawling)

Question

I wrote a simple crawler in python. It seems to work fine and find new links, but repeats the finding of the same links and it is not downloading the new web pages found. It seems like it crawls infinitely even after it reaches the set crawling depth limit. I am not getting any errors. It just runs forever. Here is the code and the run. I am using Python 2.7 on Windows 7 64bit.

import sys
import time
from bs4 import *
import urllib2
import re
from urlparse import urljoin

def crawl(url):
    url = url.strip()
    page_file_name = str(hash(url))
    page_file_name = page_file_name + ".html" 
    fh_page = open(page_file_name, "w")
    fh_urls = open("urls.txt", "a")
    fh_urls.write(url + "\n")
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")
    html_text = str(soup)
    fh_page.write(url + "\n")
    fh_page.write(page_file_name + "\n")
    fh_page.write(html_text)
    links = []
    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    links.append(link.get('href'))
    rs = []
    for link in links:
    try:
            #r = urllib2.urlparse.urljoin(url, link)
            r = urllib2.urlopen(link)
            r_str = str(r.geturl())
            fh_urls.write(r_str + "\n")
            #a = urllib2.urlopen(r)
            if r.headers['content-type'] == "html" and r.getcode() == 200:
                rs.append(r)
                print "Extracted link:"
        print link
        print "Extracted link final URL:"
        print r
    except urllib2.HTTPError as e:
            print "There is an error crawling links in this page:"
            print "Error Code:"
            print e.code
    return rs
    fh_page.close()
    fh_urls.close()

if __name__ == "__main__":
    if len(sys.argv) != 3:
    print "Usage: python crawl.py <seed_url> <crawling_depth>"
    print "e.g: python crawl.py https://www.yahoo.com/ 5"
    exit()
    url = sys.argv[1]
    depth = sys.argv[2]
    print "Entered URL:"
    print url
    html_page = urllib2.urlopen(url)
    print "Final URL:"
    print html_page.geturl()
    print "*******************"
    url_list = [url, ]
    current_depth = 0
    while current_depth < depth:
        for link in url_list:
            new_links = crawl(link)
            for new_link in new_links:
                if new_link not in url_list:
                    url_list.append(new_link)
            time.sleep(5)
            current_depth += 1
            print current_depth

Here is what I got when I ran it:

C:\Users\Hussam-Den\Desktop>python test.py https://www.yahoo.com/ 4
Entered URL:
https://www.yahoo.com/
Final URL:
https://www.yahoo.com/
*******************
1

And the output file for storing crawled urls is this one:

https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://www.yahoo.com/
https://www.yahoo.com/lifestyle/horoscope/libra/daily-20170924.html
https://policies.yahoo.com/us/en/yahoo/terms/utos/index.htm
https://policies.yahoo.com/us/en/yahoo/privacy/adinfo/index.htm
https://www.oath.com/careers/work-at-oath/
https://help.yahoo.com/kb/account

Any idea what's wrong?

So much of indentation errors. Can you fix it and upload for troubleshooting? — Swadhikar
– Swadhikar, Commented Sep 24, 2017 at 16:18

amarynets · Accepted Answer · 2017-09-24 16:17:10Z

1

You have an error here: depth = sys.argv[2], sys return str not int. You should write depth = int(sys.argv[2])
Becouse of 1 point, condition while current_depth < depth: always return True

Try to fix it by convert argv[2] to int. I thin error is there

answered Sep 24, 2017 at 16:17

amarynets

1,82511 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Danielle M. Over a year ago

@hussam-hallak The answer above is correct, I'd recommend looking at python's arparse module that does stuff like for you - you'd define max_depth as in int and it would do what's needed - very useful module.

alexis Over a year ago

Much more important: Switch to Python 3. Among other things it flags int-str comparisons as errors, so this problem would have been obvious. And tomorrow you'll be trying to scrape websites with different encodings, and pulling your hair out trying to navigate the Python 2 approach to encodings. Switch today!

amarynets Over a year ago

@alexis, yes, Python3 it's a good choice. I don't understand people which start new project on Py2)

Collectives™ on Stack Overflow

Python simple web crawler error (infinite loop crawling)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related