I have this code that loops over Top Alexa 1000 websites, and get the ones that allow Sign Up or Login in any form. If there is a website in one of the iterations of this loop that gets stuck or throws Exception in any form, I remove that from my list, and start the loop over again with the next element. I am using the selenium package in Python to do this. It works fine, except the fact that for some reason its looping over every other element in my alexa_1000 list-containing variable (i.e. skipping one element), rather than going over each element. Could anyone please help? There doesn't seem to be anything wrong with the code that I can see, and I have been debugging it to see the program flow, but really can't figure out whats happening. The general flow of the program seems to be fine. When I print the index of each loop, to see the nature of skipping, that also seems fine in the sense of going from 0 to 1 to 2 to 3. Would be glad of any help. Here's the code:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
def get_alexa_top_pages():
sites = []
with open('topsites_1000.txt', 'r') as f:
for line in f:
line = line.strip('\n')
sites.append(line)
sites = filter(None, sites)
sites = ['http://www.' + site for site in sites]
return sites
def main():
alexa_1000 = get_alexa_top_pages()
out = open('sites_with_login.txt', 'w')
sign_in_strings = ['sign in', 'signin', 'login', 'log in', 'sign up', 'signup']
driver = webdriver.Firefox()
driver.set_page_load_timeout(30)
for index, page in enumerate(alexa_1000):
try:
print "Loading page %s (num %d)" %(page, index + 1)
driver.get(page)
html_source = driver.page_source
html_source = html_source.lower()
present = any([i in html_source for i in sign_in_strings])
if present:
out.write(page + '\n')
alexa_1000.remove(page)
except TimeoutException as ex:
alexa_1000.remove(page)
continue
except Exception as ex:
alexa_1000.remove(page)
continue
out.close()
if __name__ == "__main__":
main()
for page in alexa_1000was having the same effect.nth item next, and deleting an earlier item shifts items downwards so the item that was atcollection[n]moved tocollection[n-1]. One fix is to loop over the collection backwards:for page in alexa_1000[::-1]:so shifting later items downwards won't interfere. Another fix is to make a temporary copy, as explained in The for statement.