how to scrape search results if returned in javascript (using python)

Question

The site I want to scrape populates returns using JavaScript.

Can I simply call the script somehow and work with its results? (Then without pagination, of course.) I don't want to run the entire thing to scrape the resulting formatted HTML, but the raw source is blank.

Have a look: http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0

The source of the return is simply

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/templates/base_template.xsl"?>
<content>
  <head>
    <SCRIPT type="text/javascript" src="/js/searchResultsView.js"></SCRIPT>    
  </head>
    <whitebox>
    <div id = "hits"></div>  
  </whitebox>
</content>

I would prefer simple Python tools.

I'm only just looking into this, but try PhantomJS and Selenium WebDriver. I'll try and get you an answer. — Jeffrey Tang
– Jeffrey Tang, Commented Mar 25, 2014 at 3:07

Jeffrey Tang · Accepted Answer · 2014-03-25 03:35:31Z

2

I downloaded Selenium and ChromeDriver.

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0')

for e in driver.find_elements_by_class_name('result'):
    link = e.find_element_by_tag_name('a')
    print(link.text.encode('ascii', 'ignore'), link.get_attribute('href').encode('ascii', 'ignore'))

driver.quit()

If you're using Chrome, you can inspect the page attributes using F12, which is pretty useful.

answered Mar 25, 2014 at 3:35

Jeffrey Tang

7936 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ehvince · Accepted Answer · 2014-03-25 13:24:28Z

2

Indeed you can do that with Python. You either need python-ghost or Selenium. I prefer the latter combined with PhantomJS, much lighter and simpler to install, and easy to use:

Install phantomjs with npm (Node Package Manager):

apt-get install nodejs
npm install phantomjs

install selenium:

pip install selenium

and get the resulted page like this, and parse it with beautifulSoup (or another lib) as usual:

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

edited Mar 25, 2014 at 13:24

answered Mar 25, 2014 at 9:18

Ehvince

18.7k8 gold badges64 silver badges84 bronze badges

Comments

m.wasowski · Accepted Answer · 2014-03-25 02:49:15Z

1

In nutshell: you can't do this with Python only.

As you've said, this is populated by javascript (jquery), which adds content on-the fly.

You can try running script with nodejs locally and at some point dump DOM as html. But you need to dig into js code anyway.

answered Mar 25, 2014 at 2:49

m.wasowski

6,3761 gold badge25 silver badges30 bronze badges

3 Comments

László Over a year ago

Thanks, then can you help me (or help rephrase the question) how to run the right piece of JavaScript e.g. with an AppleScript call ('tell application Google Chrome to execute ….js', but how exactly?). If you have a look at the .js file, I am happy with its returns in 'resp,' without the pagination I only need to run this only once for each year 1998-2014.

m.wasowski Over a year ago

nodejs is js interpreter that you can install and run js scripts with it. Just look at it, it's no harder to use than python shell/interpreter.

László Over a year ago

Will do, what I am not sure how I can specify the function arguments for this remote function which is built to work with queries in the page that contained it. Thanks!

Collectives™ on Stack Overflow

how to scrape search results if returned in javascript (using python)

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related