1

Recently I've been learning web scraping with Python and Beautiful Soup. However I've hit a bit of a bump when trying to scrape the following page:

http://www.librarything.com/work/3203347

The data I want from the page is the tags for the book but I can't find any way to get the data despite spending a lot of time trawling the internet.

I tried following a few guides online but none of them seemed to work. I tried converting the page to XML and JSON but I still couldn't find the data.

Pretty stumped at the moment and I'd appreciate any help.

Thanks.

1
  • 1
    Do you mean scraping a specific element on the page? IE, the data under the Tags header? Commented Nov 24, 2017 at 15:03

3 Answers 3

2

After analyzing the HTML and scripts, the tags are loaded through AJAX and requesting the AJAX url makes our life easy. Here is the python script.

import requests
from bs4 import BeautifulSoup

content = requests.get("http://www.librarything.com/ajax_work_makeworkCloud.php?work=3203347&check=2801929225").text
soup = BeautifulSoup(content)

for tag in soup.find_all('a'):
    print(tag)
Sign up to request clarification or add additional context in comments.

1 Comment

Oh my god. After literally hours and hours of searching through the internet, I find something that works. Thank you so much!!
0

Am not sure about which data you want to scrape from the page. But when checked the page loads dynamic "Tags" through a javascript which is initiated once the page loads. If your scraper is loading only the DOM Controller and parsing the webpage in the background without loading in a browser its highly possible that any of the dynamic data in the page would not load.

One possible solution is using selenium to load the page completely and then scrape it.

Comments

0

Possible implementation without BS:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

my_url = 'http://www.librarything.com/work/3203347'
driver = webdriver.Chrome()
driver.get(my_url)

delay = 5 # seconds

try:
    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'span.tag')))
    print("Page is ready!")
    for element in driver.find_elements_by_css_selector('span.tag'):
        print(element.text)
except TimeoutException:
    print("Couldn't load page")
finally:
    driver.quit()

Sources for the implementation:

Waiting until an element identified by its css is present

Locating elements with selenium

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.