Scrapy & Selenium

Question

I'm trying to scrape a single page using Scrapy and Selenium

import time
import scrapy
from selenium import webdriver

class SampleSpider(scrapy.Spider):
    name = "sample"
    start_urls = ['url-to-scrape']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)
        time.sleep(30)
        for page in response.css('a'):
            yield {
                'url-href': page.xpath('@href').extract(),
                'url-text': page.css('::text').extract()
            }
        self.driver.quit()

The spider doesn't capture know tags and outputs:

{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl01\", \"\", true, \"\", \"\", false, true))"]},
{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl02\", \"\", true, \"\", \"\", false, true))"]},
{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl03\", \"\", true, \"\", \"\", false, true))"]}

Thoughts?

Ray · Accepted Answer · 2017-09-30 16:43:22Z

3

You are reading the response from scrapy and trying to work the code on the selenium page this won't work. You need to use the response from your selenium page and create a scrapy response object from the same.

import scrapy
from selenium import webdriver

class SampleSpider(scrapy.Spider):
    name = "sample"
    start_urls = ['url-to-scrape']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)
        res = response.replace(body=self.driver.page_source)

        for page in res.css('a'):
            yield {
                'url-href': page.xpath('@href').extract(),
                'url-text': page.css('::text').extract()
            }
        self.driver.quit()

Also time.sleep is not needed in this case

edited Sep 30, 2017 at 16:43

Ray

2411 gold badge2 silver badges7 bronze badges

answered Sep 30, 2017 at 6:03

Tarun Lalwani

147k11 gold badges217 silver badges278 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

oldboy Over a year ago

wouldn't self.driver.get(response.url) have to be something like self.driver.get(response.body) instead?! doesn't response.url simply return the URL? i'm trying to figure out how to use scrapy and selenium side-by-side, but it is almost impossible to find info anywhere online regarding how to do this

Tarun Lalwani Over a year ago

@Anthony, response.body gives you plain HTML response, with JS not executed. That is why we use response.url and browser the same url in the selenium driver. Then we use self.driver.page_source to get the rendered HTML with javascript executed

Collectives™ on Stack Overflow

Scrapy & Selenium

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related