0

I'm web scraping for the first time and I'm having trouble scraping a list of urls from a website. It works fine on colaboratory when I replace the specified path with /usr/lib/chromium-browser/chromedriver but when I try this code on my IDE....

0

1 Answer 1

1

Just use chrome in the head mode. In other words, don't use headless.

from bs4 import BeautifulSoup
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome(options=options)

courses = []
for i in range(1, 2):
    wd.get(f"https://www.sydney.edu.au/courses/search.html?search-type=course&page={i}")
    html_soup = BeautifulSoup(wd.page_source, "lxml")
    for x in html_soup.findAll("a", class_="b-result-container__item-wrapper b-result-container__item-wrapper--data b-link--no-underline"):
        courses.append(x.get("href"))

for x in courses:
    print(x)

Output:

https://www.sydney.edu.au/courses/courses/uc/bachelor-of-arts.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-science.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-commerce.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-economics.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-psychology0.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-pharmacy.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-music.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-science-health.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-arts-honours.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-advanced-computing.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-oral-health.html
https://www.sydney.edu.au/courses/courses/uc/bachelor-of-visual-arts.html

You get this error because of the HeadlessChrome/89.0.4389.90 header. It's in the error traceback:

darkorange", source: https://www.sydney.edu.au/etc.clientlibs/courses/clientlibs/frontend-js.js (11714)
[0323/232203.250:INFO:CONSOLE(3)] "Hotjar not launching due to suspicious userAgent: Mozilla/5.0 (Windows NT 1
0.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/89.0.4389.90 Safari/537.36", source: ht
tps://static.hotjar.com/c/hotjar-550296.js?sv=6 (3)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.