4

I'm trying to scrape Merriam-Webster's Medical Dictionary for medical terms using Python and Chrome as the Selenium webdriver. So far, this is what I have:

    from os import path
    from selenium import webdriver

    # Adding an ad-blocker to Chrome to speed up page load times
    options = webdriver.ChromeOptions()
    options.add_extension(path.abspath("ublock-origin.crx"))

    # Declaring the Selenium webdriver
    driver = webdriver.Chrome(chrome_options = options)

    # Fetching the "A" terms as a test set
    driver.get("https://www.merriam-webster.com/browse/medical/a")

    scraped_words = []  # The list that will hold each word
    page_num = 1
    while page_num < 55:  # There are 54 pages of "A" terms
        try:
            for i in range(4):  # There are 3 columns per page of words
                column = "/html/body/div/div/div[5]/div[2]/div[1]/div/div[3]/ul/li[" + str(i) + "]/a"
                number_of_words = len(driver.find_elements_by_xpath(column))
                for j in range(number_of_words):
                    word = driver.find_elements_by_xpath(column + "[" + str(j) + "]")
                    scraped_words.append(word)
            driver.find_element_by_class_name("fa-angle-right").click()  # Next page
            page_num += 1  # Increment page number to keep track of current page
        except:
            driver.close()

    # Write out words to a file
    with open("medical_terms.dict", "w") as text_file:
        for i in range(len(scraped_words)):
            text_file.write(str(scraped_words[i]))
            text_file.write("\n")

    driver.close()

The above code fetches all the items, as the output of len(scraped_words) is the number expected. However, since I did not specify that I wanted to fetch the text of the elements, I get element identifiers (I think?) instead of text. If I decide to use word = driver.find_elements_by_xpath(column + "[" + str(j) + "]").text in order to specify that I want to get the text of the element, I get the following error:

Traceback (most recent call last):
  File "mw_download.py", line 20, in <module>
    number_of_words = len(driver.find_elements_by_xpath(column))
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 325, in find_elements_by_xpath
    return self.find_elements(by=By.XPATH, value=xpath)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 817, in find_elements
    'value': value})['value']
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: no such session
  (Driver info: chromedriver=2.31.488774 (7e15618d1bf16df8bf0ecf2914ed1964a387ba0b),platform=Mac OS X 10.12.6 x86_64)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mw_download.py", line 27, in <module>
    driver.close()
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 541, in close
    self.execute(Command.CLOSE)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: no such session
  (Driver info: chromedriver=2.31.488774 (7e15618d1bf16df8bf0ecf2914ed1964a387ba0b),platform=Mac OS X 10.12.6 x86_64)

What is strange to me here is that the only code I change between runs is on line 22 yet the error message points out line 20 instead.

Any help in deciphering what's going on here and what I can do to fix it would be much appreciated! :+)

1 Answer 1

3

You just need to create a words list accessing your elements texts, changing:

word = driver.find_elements_by_xpath(column + "[" + str(j) + "]")

to:

word = [i.text for i in driver.find_elements_by_xpath(column + "[" + str(j) + "]")]

Because .find_elements_by_xpath will always return a list, accessing .text directly won't work.

Sign up to request clarification or add additional context in comments.

15 Comments

Great explanation; thank you for helping me understand what was going wrong! :+)
The website is not responding well to my queries now, so I can't test it =/ I'd say this above should work. Also, note you can just delete empty lists in post-processing.
@paanvaannd word is a list, try operating your changes using word[0]
Thanks! I had noticed my mistake a couple hours ago and it's working for the most part now! I still can't seem to remove the blank lines for some reason but I'll figure it out eventually. Thanks for all the help, and I hope you have a great week! :+)
Got it to work now; I wasn't including that line where I should have been. Thanks for all your help, again! :+)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.