How to continuously crawl a webpage for articles using Selenium in Python

Question

I'm trying to crawl bloomberg.com and find links for all English news articles. The problem with the below code is that, it does find a lot of articles from the first page but the it just goes into a loop that it does not return anything and goes once in a while.

from collections import deque
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

visited = set()
to_crawl = deque()
to_crawl.append("https://www.bloomberg.com")

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")
    for elem in elems:
        #retrieve all href links and save it to url_element variable
        url_element = elem.get_attribute("href")
        if url_element not in visited:
            to_crawl.append(url_element)
            visited.add(url_element)
            #save news articles
            if 'www.bloomberg.com/news/articles' in url_element:
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element) + "\n")
    browser.close()

while len(to_crawl):
    url_to_crawl = to_crawl.pop()
    crawl_link(url_to_crawl)

I've tried using a queue and then used a stack, but the behavior is the same. I cannot seem to be able to accomplish what im looking for.

How do you crawl websites like this to crawl news urls?

Alexander · Accepted Answer · 2022-06-03 00:35:28Z

2

The approach you are using should work fine, however after running it myself there are a few things that I noticed are causing it to hang or throw errors.

I made some adjustments and included some in-line comments to explain my reasons.

from collections import deque
from selenium.common.exceptions import StaleElementReferenceException
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

base = "https://www.bloomberg.com"
article = base + "/news/articles"
visited = set()


# A set discards duplicates automatically and is more efficient for lookups
articles = set()

to_crawl = deque()
to_crawl.append(base)

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    print(input_url)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")

    # this part was the issue, before this line there was 
    # `to_crawl.append()` which was prematurely adding links 
    # to the visited list so those links were skipped over without
    # being crawled
    visited.add(input_url)

    for elem in elems:

        # checks for errors
        try:
            url_element = elem.get_attribute("href")
        except StaleElementReferenceException as err:
            print(err)
            continue

        # checks to make sure links aren't being crawled more than once
        # and that all the links are in the propper domain
        if base in url_element and all(url_element not in i for i in [visited, to_crawl]):

            to_crawl.append(url_element)

            # this checks if the link matches the correct url pattern
            if article in url_element and url_element not in articles:

                articles.add(url_element)
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element) + "\n")
    
    browser.quit() # guarantees the browser closes completely


while len(to_crawl):
    # popleft makes the deque a FIFO instead of LIFO.
    # A queue would achieve the same thing.
    url_to_crawl = to_crawl.popleft()

    crawl_link(url_to_crawl)

After running for 60+ seconds this was the output of result.txt https://gist.github.com/alexpdev/b7545970c4e3002b1372e26651301a23

edited Jun 3, 2022 at 0:35

answered Apr 30, 2022 at 8:40

Alexander

17.5k5 gold badges15 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Joey Joestar Over a year ago

After running this I get https://www.bloomberg.com, https://www.bloomberg.com/feedback, https://www.bloomberg.com/notices/tos. Do you know why? I am not getting the same result you're getting for some reason

Joey Joestar Over a year ago

yes! I do see the results in the txt file. However, it's only 464 lines of links. and it just stops there. Im not sure what the best way to keep crawling links

Joey Joestar Over a year ago

no because the output only has todays and yesterdays dates. so for some reason we're not being able to visit other article links.

ASH Over a year ago

@alexpdev -- nice one!!

Collectives™ on Stack Overflow

How to continuously crawl a webpage for articles using Selenium in Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related