14

I'm trying to speed up Selenium/PhantomJS webscraper in Python by preventing download of CSS/other resources. All I need to download is img src and alt tags. I've found this code:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

via: How can I control PhantomJS to skip download some kind of resource?

How/where can I implement this code in Selenium driven by Python? Or, is there another better way to stop CSS/other resources from downloading?

Note: I've already found how to prevent image download by editing service_args variable via:

How do I set a proxy for phantomjs/ghostdriver in python webdriver?

and

PhantomJS 1.8 with Selenium on python. How to block images?

But service_args can't help me with resources like CSS. Thanks!

3
  • If all you want is the HTML and select elements from the page, is Selenium/PhantomJS the best option? Have you considered using python-requests? Commented Oct 10, 2013 at 13:43
  • @brechin, that's a great idea, thanks! Unfortunately I don't think python-requests can get javascript injected content. For example, see the main image on this page: everlane.com/collections/mens-luxury-tees/products/…. Everything in <div id="content" class="clearfix"> is injected via backbone.js, and in my output from python-requests, I simply get an empty div with the <!-- Filled in by Chaplin --> comment... Might I be missing something? Commented Oct 14, 2013 at 21:59
  • I'd look at the requests and just grab everlane.com/api/collections Commented Oct 22, 2013 at 23:11

3 Answers 3

7

A bold young soul by the name of “watsonmw” recently added functionality to Ghostdriver (which Phantom.js uses to interface with Selenium) that allows access to Phantom.js API calls which require a page object, like the onResourceRequested one you cited.

For a solution at all costs, consider building from source (which developers note “takes roughly 30 minutes ... with 4 parallel compile jobs on a modern machine”) and integrating his patch, linked above.

Then this (untested) Python code should work as a proof of concept:

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})

Until then, you’ll just get a Can't find variable: page exception.

Good luck! There are a lot of great alternatives, like working in a Javascript environment, driving Gecko, proxies, etc.

Sign up to request clarification or add additional context in comments.

1 Comment

It seems that the patch is already in Ghostdriver 1.1.0, but when I start it (with phantomjs /path/to/ghostdriver/1.1.0/src/main.js) and connect to it (with driver = webdriver.PhantomJS(port=8910) ) I still get Can't find variable: page.
4

Will's answer got me on track. (Thanks Will!)

Current PhantomJS (1.9.8) includes Ghostdriver 1.1.0 which already contains watsonmw's patch.

You need to download the latest PhantomJS, perform the following (sudo may be required):

ln -s path/to/bin/phantomjs  /usr/local/share/phantomjs
ln -s path/to/bin/phantomjs  /usr/local/bin/phantomjs
ln -s path/to/bin/phantomjs  /usr/bin/phantomjs

And then try this:

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
    var page = this; // won't work otherwise
    page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})

Comments

2

Proposed solutions didn't work for me, but this one works (it uses driver.execute_script):

driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute_script('''
    this.onResourceRequested = function(request, net) {
        console.log('REQUEST ' + request.url);
    };
''')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.