Integrating Selenium with Scrapy

Question

Is there any way to effectively integrate Selenium into Scrapy for it's page rendering capabilities (in order to generate screenshots)?

A lot of solutions I've seen just throw a Scrapy request/response URL at WebDriver after Scrapy's already processed the request, and then just works off that. This creates twice as many requests, fails in many ways (sites requiring logins, sites with dynamic or pseudo-random content, etc.), and invalidates many extensions/middleware.

Is there any "good" way of getting the two to work together? Is there a better way for generating screenshots of the content I'm scraping?

Community · Accepted Answer · 2017-05-23 10:31:22Z

6

Use Scrapy's Downloader Middleware. See my answer on another question for a simple example: https://stackoverflow.com/a/31186730/639806

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Jul 14, 2015 at 13:58

JoeLinux

4,3271 gold badge31 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Rejected Over a year ago

I've looked at this, and while it does fix one of the issues (doubling up on requests), it bypasses many features Scrapy provides. It discard user-agent configuration, proxy configurations, headers, and offers zero persistence between calls (no sessions/cookies). Furthermore, it's impossible to submit POST requests in Selenium, so things like FormRequests will break or have very unexpected results.

JoeLinux Over a year ago

It does bypass those things. It's a very simple example, but a lot of those things can be duplicated in Selenium (such as cookies, headers and user-agent string). In fact, most of that info you can pull using the request information that's available as an arg to the process_request method. Also, you won't need to POST through Selenium. No reason you can't do that through Scrapy in parse after pulling the Selenium response.

Rejected Over a year ago

Wouldn't the FormRequest be 'hijacked' by the Selenium Downloader Middleware as it passed through, and then processed as a driver.get(url)" by Selenium? How could this be prevented?

JoeLinux Over a year ago

Use a conditional (e.g., if should_process_js(request):), and just return return request to continue processing normally if whatever conditions are false (such as the request being a POST, or whatever you decide).

Rejected Over a year ago

I've worked on this and found other issues, that I was curious if you had any thoughts on. Returning an HtmlResponse doesn't fire off the response_downloaded signal, and anything relying on it breaks (such as throttling). CustomHeaders, most importantly "Referer" cannot be manually set on WebDriver.

Collectives™ on Stack Overflow

Integrating Selenium with Scrapy

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related