0

After running my below script It crawls the first link successfully and fetch the title and description but when it comes to do the same for the next link I encounter stale element reference: in this line data = [urljoin(link,item.get_attribute("href"))---. How can I complete the operation without this error?

This is the script:

from urllib.parse import urljoin
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "http://urbantoronto.ca/database/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)

for items in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#project_list table tr[id^='project']"))):
    data = [urljoin(link,item.get_attribute("href")) for item in items.find_elements_by_css_selector("a[href^='//urbantoronto']")]

    #I get stale "element reference" error exactly here pointing the above line

    for nlink in data:
        driver.get(nlink)
        sitem = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1.title")))
        title = sitem.text
        try:
            desc = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".project-description p"))).text
        except Exception: desc = ""
        print("Title: {}\nDescription: {}\n".format(title,desc))

driver.quit()
2
  • I guess this is because you defines the sitem after driver.get(nlink). You should do this before navigating to next page. And also put wait.until(EC.staleness_of(sitem)) just after driver.get(nlink) Commented Jun 12, 2018 at 7:37
  • See the edit sir. Commented Jun 12, 2018 at 11:20

1 Answer 1

1

The real problem is your outer loop. The 'items' that your iterating over all go stale as soon as you change pages, i.e., driver.get(nlink). That's why you were getting the StaleElementException on the second time through the loop at items.find_elements... The reason it's timing out on 'sitem' is because elements only go stale when the DOM changes. If the DOM doesn't change, well, you could be waiting awhile for a stale element.

With that in mind, I suggest slightly different approach using BeautifulSoup. Selenium is great for javascript execution and all, but a little slow when it comes to parsing HTML, which is something you're doing for all of those table rows. So I suggest the following changes:

from urllib.parse import urljoin
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup as bs

link = "http://urbantoronto.ca/database/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)

# For readability
by_selector = (By.CSS_SELECTOR, "#project_list table tr[id^='project']")
wait.until(EC.presence_of_all_elements_located(by_selector))

# Get HTML content
soup = bs(driver.page_source, 'lxml')

# Find div containing project table
table = soup.find('div', {'id': 'project_list'})

# Find all the project rows
projects = table.find_all('tr', {'id': re.compile('^project\d+')})

# Create page links
links = ['http:' + x.find('a')['href'] for x in projects]

for nlink in links:

    driver.get(nlink)
    sitem = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1.title")))
    title = sitem.text
    try:
        desc = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".project-description p"))).text
    except Exception: desc = ""
    print("Title: {}\nDescription: {}\n".format(title,desc))

driver.quit()

EDIT: Here is a pure selenium solution:

link = "http://urbantoronto.ca/database/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 10)

# For readability
condition = (By.CSS_SELECTOR, "#project_list table tr[id^='project']")
tr = wait.until(EC.presence_of_all_elements_located(condition))

# Get links, this will take a few seconds with Selenium
selector = "a[href^='//urbantoronto']"
links = [x.find_element_by_css_selector(selector).get_attribute('href') for x in tr]

for nlink in links:

    driver.get(nlink)
    sitem = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "h1.title")))
    title = sitem.text
    try:
        desc = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".project-description p"))).text
    except Exception: desc = ""
    print("Title: {}\nDescription: {}\n".format(title,desc))

driver.quit()

And just to be clear, you need to extract the URLs before the loop to avoid the stale element issue you were having.

Sign up to request clarification or add additional context in comments.

3 Comments

It does the job @T. Ray. +1 for this. However, I'm still sanguine to have any pure selenium solution.
Then pure selenium you shall have. See my updated answer.
I wish I coud upvote your solution several times. You made it. So, my defined nested loop was the culprit!! Thanks a lot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.