2

The site I want to scrape populates returns using JavaScript.

Can I simply call the script somehow and work with its results? (Then without pagination, of course.) I don't want to run the entire thing to scrape the resulting formatted HTML, but the raw source is blank.

Have a look: http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0

The source of the return is simply

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/templates/base_template.xsl"?>
<content>
  <head>
    <SCRIPT type="text/javascript" src="/js/searchResultsView.js"></SCRIPT>    
  </head>
    <whitebox>
    <div id = "hits"></div>  
  </whitebox>
</content>

I would prefer simple Python tools.

1
  • 1
    I'm only just looking into this, but try PhantomJS and Selenium WebDriver. I'll try and get you an answer. Commented Mar 25, 2014 at 3:07

3 Answers 3

2

I downloaded Selenium and ChromeDriver.

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0')

for e in driver.find_elements_by_class_name('result'):
    link = e.find_element_by_tag_name('a')
    print(link.text.encode('ascii', 'ignore'), link.get_attribute('href').encode('ascii', 'ignore'))

driver.quit()

If you're using Chrome, you can inspect the page attributes using F12, which is pretty useful.

Sign up to request clarification or add additional context in comments.

Comments

2

Indeed you can do that with Python. You either need python-ghost or Selenium. I prefer the latter combined with PhantomJS, much lighter and simpler to install, and easy to use:

Install phantomjs with npm (Node Package Manager):

apt-get install nodejs
npm install phantomjs

install selenium:

pip install selenium

and get the resulted page like this, and parse it with beautifulSoup (or another lib) as usual:

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

Comments

1

In nutshell: you can't do this with Python only.

As you've said, this is populated by javascript (jquery), which adds content on-the fly.

You can try running script with nodejs locally and at some point dump DOM as html. But you need to dig into js code anyway.

3 Comments

Thanks, then can you help me (or help rephrase the question) how to run the right piece of JavaScript e.g. with an AppleScript call ('tell application Google Chrome to execute ….js', but how exactly?). If you have a look at the .js file, I am happy with its returns in 'resp,' without the pagination I only need to run this only once for each year 1998-2014.
nodejs is js interpreter that you can install and run js scripts with it. Just look at it, it's no harder to use than python shell/interpreter.
Will do, what I am not sure how I can specify the function arguments for this remote function which is built to work with queries in the page that contained it. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.