0

I have this code that loops over Top Alexa 1000 websites, and get the ones that allow Sign Up or Login in any form. If there is a website in one of the iterations of this loop that gets stuck or throws Exception in any form, I remove that from my list, and start the loop over again with the next element. I am using the selenium package in Python to do this. It works fine, except the fact that for some reason its looping over every other element in my alexa_1000 list-containing variable (i.e. skipping one element), rather than going over each element. Could anyone please help? There doesn't seem to be anything wrong with the code that I can see, and I have been debugging it to see the program flow, but really can't figure out whats happening. The general flow of the program seems to be fine. When I print the index of each loop, to see the nature of skipping, that also seems fine in the sense of going from 0 to 1 to 2 to 3. Would be glad of any help. Here's the code:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException


def get_alexa_top_pages():

    sites = []

    with open('topsites_1000.txt', 'r') as f:

        for line in f:
            line = line.strip('\n')
            sites.append(line)

    sites = filter(None, sites)


    sites = ['http://www.' + site for site in sites]

    return sites

def main():

    alexa_1000 = get_alexa_top_pages()
    out = open('sites_with_login.txt', 'w')
    sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']

    driver = webdriver.Firefox()
    driver.set_page_load_timeout(30)

    for index, page in enumerate(alexa_1000):


        try:

            print "Loading page %s (num %d)" %(page, index + 1)
            driver.get(page)
            html_source = driver.page_source
            html_source = html_source.lower()
            present = any([i in html_source for i in sign_in_strings])
            if present:
                out.write(page + '\n')

            alexa_1000.remove(page)

        except TimeoutException as ex:
            alexa_1000.remove(page)
            continue
        except Exception as ex:
            alexa_1000.remove(page)
            continue    


    out.close()

if __name__ == "__main__":
    main()
4
  • Removing an element while enumerating a collection will interfere with the enumeration. Commented Feb 4, 2018 at 1:33
  • @Jerry101 I added enumerate only later to test it out, otherwise the usual for page in alexa_1000 was having the same effect. Commented Feb 4, 2018 at 1:45
  • 1
    Yes, I meant "enumerating" in the sense of looping over the values. This common gotcha arises because the loop remembers to provide the nth item next, and deleting an earlier item shifts items downwards so the item that was at collection[n] moved to collection[n-1]. One fix is to loop over the collection backwards: for page in alexa_1000[::-1]: so shifting later items downwards won't interfere. Another fix is to make a temporary copy, as explained in The for statement. Commented Feb 4, 2018 at 3:43
  • @Jerry101 Thanks so much for that explanation! :) Learnt something awesome. Commented Feb 4, 2018 at 17:56

1 Answer 1

2

There are different ways to get rid of the issue. The issue is because you are touching a enumeration while enumerating it. That should always be avoided. You can do that by rewriting your code

Using sets instead of array

from selenium import webdriver
from selenium.common.exceptions import TimeoutException


def get_alexa_top_pages():

    sites = []

    with open('topsites_1000.txt', 'r') as f:

        for line in f:
            line = line.strip('\n')
            sites.append(line)

    sites = filter(None, sites)


    sites = ['http://www.' + site for site in sites]

    return sites

def main():

    alexa_1000 = set(get_alexa_top_pages())

    alexa_invalid = set()

    out = open('sites_with_login.txt', 'w')
    sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']

    driver = webdriver.Firefox()
    driver.set_page_load_timeout(30)

    for index, page in enumerate(alexa_1000):


        try:

            print "Loading page %s (num %d)" %(page, index + 1)
            driver.get(page)
            html_source = driver.page_source
            html_source = html_source.lower()
            present = any([i in html_source for i in sign_in_strings])
            if present:
                out.write(page + '\n')

        except TimeoutException as ex:
            alexa_invalid.add(page)
            continue
        except Exception as ex:
            alexa_invalid.add(page)
            continue    

    alexa_valid = alexa_1000 - alexa_invalid

    out.close()

if __name__ == "__main__":
    main()

In this you use set, one for looping and one for maintaining the list of invalid ones. If exception occurs you update the invalid one. At the end you can subtract the two to find valid sites as well

Use reverse array and pop

from selenium import webdriver
from selenium.common.exceptions import TimeoutException


def get_alexa_top_pages():

    sites = []

    with open('topsites_1000.txt', 'r') as f:

        for line in f:
            line = line.strip('\n')
            sites.append(line)

    sites = filter(None, sites)


    sites = ['http://www.' + site for site in sites]

    return sites

def main():

    alexa_1000 = get_alexa_top_pages()

    out = open('sites_with_login.txt', 'w')
    sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']

    driver = webdriver.Firefox()
    driver.set_page_load_timeout(30)

    for index, page in enumerate(alexa_1000[::-1]):


        try:

            print "Loading page %s (num %d)" %(page, index + 1)
            driver.get(page)
            html_source = driver.page_source
            html_source = html_source.lower()
            present = any([i in html_source for i in sign_in_strings])
            if present:
                out.write(page + '\n')

        except TimeoutException as ex:
            alexa_1000.pop()
            continue
        except Exception as ex:
            alexa_1000.pop()
            continue    

    out.close()

if __name__ == "__main__":
    main()

In this you loop in reverse order and the ones which error out, you just pop them out. At the end alexa_1000 will have all the valid websites which you processed

There would be lot many ways to approach this, above show just 2 of them that you can ideally use

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks so much for this answer! I appreciate it a lot. The reverse array pop example is pretty cool.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.