how to scrape imbeded script on webpage in python

Question

For example, I have webpage http://www.amazon.com/dp/1597805483.

I want to use xpath to scrape this sentence Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.

page = requests.get(url)
tree = html.fromstring(page.text)
feature_bullets = tree.xpath('//*[@id="iframeContent"]/div/text()')
print feature_bullets

Nothing is returned by above code. The reason is the xpath interpreted by browser is different from source code. But I don't know how to get the xpath from source code.

alecxe · Accepted Answer · 2014-10-31 18:19:58Z

There is a lot of things involved in building the page you are web-scraping.

As for description, specifically, the underlying HTML is constructed inside a javascript function:

<script type="text/javascript">

    P.when('DynamicIframe').execute(function (DynamicIframe) {
        var BookDescriptionIframe = null,
                bookDescEncodedData = "%3Cdiv%3E%3CB%3EA%20Fantastic%20Anthology%20Combining%20the%20Love%20of%20Science%20Fiction%20with%20Our%20National%20Pastime%3C%2FB%3E%3CBR%3E%3CBR%3EOf%20all%20the%20sports%20played%20across%20the%20globe%2C%20none%20has%20more%20curses%20and%20superstitions%20than%20baseball%2C%20America%26%238217%3Bs%20national%20pastime.%3Cbr%3E%3CBR%3E%3CI%3EField%20of%20Fantasies%3C%2FI%3E%20delves%20right%20into%20that%20superstition%20with%20short%20stories%20written%20by%20several%20key%20authors%20about%20baseball%20and%20the%20supernatural.%20%20Here%20you%27ll%20encounter%20ghostly%20apparitions%20in%20the%20stands%2C%20a%20strangely%20charming%20vampire%20double-play%20combination%2C%20one%20fan%20who%20can%20call%20every%20shot%20and%20another%20who%20can%20see%20the%20past%2C%20a%20sad%20alternate-reality%20for%20the%20game%27s%20most%20famous%20player%2C%20unlikely%20appearances%20on%20the%20field%20by%20famous%20personalities%20from%20Stephen%20Crane%20to%20Fidel%20Castro%2C%20a%20hilariously%20humble%20teenage%20phenom%2C%20and%20much%20more.%20In%20this%20wonderful%20anthology%20are%20stories%20from%20such%20award-winning%20writers%20as%3A%3CBR%3E%3CBR%3EStephen%20King%20and%20Stewart%20O%26%238217%3BNan%3Cbr%3EJack%20Kerouac%3CBR%3EKaren%20Joy%20Fowler%3CBR%3ERod%20Serling%3CBR%3EW.%20P.%20Kinsella%3CBR%3EAnd%20many%20more%21%3CBR%3E%3CBR%3ENever%20has%20a%20book%20combined%20the%20incredible%20with%20great%20baseball%20fiction%20like%20%3CI%3EField%20of%20Fantasies%3C%2FI%3E.%20This%20wide-ranging%20collection%20reaches%20from%20some%20of%20the%20earliest%20classics%20from%20the%20pulp%20era%20and%20baseball%27s%20golden%20age%2C%20all%20the%20way%20to%20material%20appearing%20here%20for%20the%20first%20time%20in%20a%20print%20edition.%20Whether%20you%20love%20the%20game%20or%20just%20great%20fiction%2C%20these%20stories%20will%20appeal%20to%20all%2C%20as%20the%20writers%20in%20this%20anthology%20bring%20great%20storytelling%20of%20the%20strange%20and%20supernatural%20to%20the%20plate%2C%20inning%20after%20inning.%3CBR%3E%3C%2Fdiv%3E",
                bookDescriptionAvailableHeight,
                minBookDescriptionInitialHeight = 112,
                options = {};
    ...

</script>

The idea here would be to get the script tag's text, extract the description value using regular expressions, unquote the HTML, parse it with lxml.html and get the .text_content():

import re
from urlparse import unquote

from lxml import html
import requests

url = "http://rads.stackoverflow.com/amzn/click/1597805483"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
tree = html.fromstring(page.content)

script = tree.xpath('//script[contains(., "bookDescEncodedData")]')[0]
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
if match:
    description_html = html.fromstring(unquote(match.group(1)))
    print description_html.text_content()

Prints:

A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime. 
Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural.  
Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. 
In this wonderful anthology are stories from such award-winning writers as:Stephen King and Stewart O’NanJack KerouacKaren Joy FowlerRod SerlingW. P. KinsellaAnd many more!Never has a book combined the incredible with great baseball fiction like Field of Fantasies. 
This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.

Similar solution, but using BeautifulSoup:

import re
from urlparse import unquote

from bs4 import BeautifulSoup
import requests

url = "http://rads.stackoverflow.com/amzn/click/1597805483"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
soup = BeautifulSoup(page.content)

script = soup.find('script', text=lambda x:'bookDescEncodedData' in x)
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
if match:
    description_html = BeautifulSoup(unquote(match.group(1)))
    print description_html.text

Alternatively, you can take a high-level approach and use a real browser with the help of selenium:

from selenium import webdriver

url = "http://rads.stackoverflow.com/amzn/click/1597805483"

driver = webdriver.Firefox()
driver.get(url)

iframe = driver.find_element_by_id('bookDesc_iframe')
driver.switch_to.frame(iframe)

print driver.find_element_by_id('iframeContent').text

driver.close()

Produces a much more nicer formatted output:

A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime

Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.

Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural. Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. In this wonderful anthology are stories from such award-winning writers as:

Stephen King and Stewart O’Nan
Jack Kerouac
Karen Joy Fowler
Rod Serling
W. P. Kinsella
And many more!

Never has a book combined the incredible with great baseball fiction like Field of Fantasies. This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.

@so3 chrome developer tools and brain developer tools :) The xpath is pretty simple as you may see - I just check for the text inside the script tag.
but chrome developer tools not giving you the xpath for original source code
@so3 well, I made a research trying to find parts of the description and found out that they are hidden inside that script tag. This is basically what was critically important to achieve.

Collectives™ on Stack Overflow

how to scrape imbeded script on webpage in python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related