Web Scraping specific page with Python

Question

Recently I've been learning web scraping with Python and Beautiful Soup. However I've hit a bit of a bump when trying to scrape the following page:

http://www.librarything.com/work/3203347

The data I want from the page is the tags for the book but I can't find any way to get the data despite spending a lot of time trawling the internet.

I tried following a few guides online but none of them seemed to work. I tried converting the page to XML and JSON but I still couldn't find the data.

Pretty stumped at the moment and I'd appreciate any help.

Thanks.

Do you mean scraping a specific element on the page? IE, the data under the Tags header? — Steve Byrne
– Steve Byrne, Commented Nov 24, 2017 at 15:03

Vivek Harikrishnan · Accepted Answer · 2017-11-24 15:39:12Z

2

After analyzing the HTML and scripts, the tags are loaded through AJAX and requesting the AJAX url makes our life easy. Here is the python script.

import requests
from bs4 import BeautifulSoup

content = requests.get("http://www.librarything.com/ajax_work_makeworkCloud.php?work=3203347&check=2801929225").text
soup = BeautifulSoup(content)

for tag in soup.find_all('a'):
    print(tag)

answered Nov 24, 2017 at 15:39

Vivek Harikrishnan

8765 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Univold Over a year ago

Oh my god. After literally hours and hours of searching through the internet, I find something that works. Thank you so much!!

Goutham Santhakumar · Accepted Answer · 2017-11-24 15:10:35Z

0

Am not sure about which data you want to scrape from the page. But when checked the page loads dynamic "Tags" through a javascript which is initiated once the page loads. If your scraper is loading only the DOM Controller and parsing the webpage in the background without loading in a browser its highly possible that any of the dynamic data in the page would not load.

One possible solution is using selenium to load the page completely and then scrape it.

answered Nov 24, 2017 at 15:10

Goutham Santhakumar

14 bronze badges

Comments

BoboDarph · Accepted Answer · 2017-11-24 15:44:10Z

Possible implementation without BS:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

my_url = 'http://www.librarything.com/work/3203347'
driver = webdriver.Chrome()
driver.get(my_url)

delay = 5 # seconds

try:
    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'span.tag')))
    print("Page is ready!")
    for element in driver.find_elements_by_css_selector('span.tag'):
        print(element.text)
except TimeoutException:
    print("Couldn't load page")
finally:
    driver.quit()

Sources for the implementation:

Waiting until an element identified by its css is present

Locating elements with selenium

Collectives™ on Stack Overflow

Web Scraping specific page with Python

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related