1

I am making a web crawler. I'm not using scrapy or anything, I'm trying to have my script do most things. I have tried doing a search for the issue however I can't seem to find anything that helps with the error. I've tried switching around some of the variable to try and narrow down the problem. I am getting an error on line 24 saying IndexError: string index out of range. The functions run on the first url, (the original url) then the second and fail on the third in the original array. I'm lost, any help would be appreciated greatly! Note, I'm only printing all of them for testing, I'll eventually have them printed to a text file.

import requests
from bs4 import BeautifulSoup

# creating requests from user input
url = raw_input("Please enter a domain to crawl, without the 'http://www' part : ")

def makeRequest(url):
    r = requests.get('http://' + url)
    # Adding in BS4 for finding a tags in HTML
    soup =  BeautifulSoup(r.content, 'html.parser')
    # Writes a as the link found in the href
    output = soup.find_all('a')
    return output


def makeFilter(link):
    # Creating array for our links
    found_link = []
    for a in link:
        a = a.get('href')
        a_string = str(a)

        # if statement to filter our links
        if a_string[0] == '/': # this is the line with the error
            # Realtive Links
            found_link.append(a_string)

        if 'http://' + url in a_string:
            # Links from the same site
            found_link.append(a_string)

        if 'https://' + url in a_string:
            # Links from the same site with SSL
            found_link.append(a_string)

        if 'http://www.' + url in a_string:
            # Links from the same site
            found_link.append(a_string)

        if 'https://www.' + url in a_string:
            # Links from the same site with SSL
            found_link.append(a_string)
        #else:  
        #   found_link.write(a_string + '\n') # testing only
    output = found_link

    return output   

# Function for removing duplicates
def remove_duplicates(values):
    output = []
    seen = set()
    for value in values:
        if value not in seen:
            output.append(value)
            seen.add(value)
    return output

# Run the function with our list in this order -> Makes the request -> Filters the links -> Removes duplicates
def createURLList(values):
    requests = makeRequest(values)
    new_list = makeFilter(requests)
    filtered_list = remove_duplicates(new_list)

    return filtered_list

result = createURLList(url)

# print result

# for verifying and crawling resulting pages
for b in result:
    sub_directories = createURLList(url + b)
    crawler = []
    crawler.append(sub_directories)

    print crawler
6
  • Have you printed a_string? It's an empty string. Commented Jan 4, 2017 at 1:23
  • I did try that, but I'm getting that same thing although it prints three strings and then errors out with the same error. Commented Jan 4, 2017 at 1:28
  • Unless I'm missing something, you only have print crawler in your code. Try print "this is string_a", a right underneath a_string = str(a). It is almost certainly blank. Commented Jan 4, 2017 at 1:32
  • OK I did that, although it's not empty I get the print out of all the links that it found. Commented Jan 4, 2017 at 1:41
  • I honestly don't know of any other way to get IndexError: string index out of range unless you had this is string a and nothing after it printed out immediately before the error. You could try import sys and do sys.stdout.flush() right after the print statement and before if a_string[0] == '/' to check you see the last the last print statement properly. Commented Jan 4, 2017 at 1:45

1 Answer 1

1

After a_string = str(a) try adding:

if not a_string:
  continue
Sign up to request clarification or add additional context in comments.

6 Comments

This actually looked like it may have cleared the issue, it still fails about eighteen links down but the error is different. Thanks! Would that mean that it is an empty string, somewhere?
Yeah, it means a_string is falsy ("", None, False, 0, etc), but almost certainly an empty string.
I'm sorry of this is a stupid question would it be better if I didn't have the string function? Or should I keep digging?
@GeorgeOffley It shouldn't really matter. Where/what is the error now?
This is the new error. As I said, the script runs for about eighteen of the links. requests.exceptions.ConnectionError: HTTPConnectionPool(host='python.orghttp', port=80): Max retries exceeded with url: //python.org/dev/peps/ (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f297330f110>: Failed to establish a new connection: [Errno -2] Name or service not known',))
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.