I'm trying to crawl bloomberg.com and find links for all English news articles. The problem with the below code is that, it does find a lot of articles from the first page but the it just goes into a loop that it does not return anything and goes once in a while.
from collections import deque
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
visited = set()
to_crawl = deque()
to_crawl.append("https://www.bloomberg.com")
def crawl_link(input_url):
options = Options()
options.add_argument('--headless')
browser = webdriver.Firefox(options=options)
browser.get(input_url)
elems = browser.find_elements(by=By.XPATH, value="//a[@href]")
for elem in elems:
#retrieve all href links and save it to url_element variable
url_element = elem.get_attribute("href")
if url_element not in visited:
to_crawl.append(url_element)
visited.add(url_element)
#save news articles
if 'www.bloomberg.com/news/articles' in url_element:
print(str(url_element))
with open("result.txt", "a") as outf:
outf.write(str(url_element) + "\n")
browser.close()
while len(to_crawl):
url_to_crawl = to_crawl.pop()
crawl_link(url_to_crawl)
I've tried using a queue and then used a stack, but the behavior is the same. I cannot seem to be able to accomplish what im looking for.
How do you crawl websites like this to crawl news urls?