0

I'm trying to scrape a single page using Scrapy and Selenium

import time
import scrapy
from selenium import webdriver

class SampleSpider(scrapy.Spider):
    name = "sample"
    start_urls = ['url-to-scrape']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)
        time.sleep(30)
        for page in response.css('a'):
            yield {
                'url-href': page.xpath('@href').extract(),
                'url-text': page.css('::text').extract()
            }
        self.driver.quit()

The spider doesn't capture know tags and outputs:

{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl01\", \"\", true, \"\", \"\", false, true))"]},
{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl02\", \"\", true, \"\", \"\", false, true))"]},
{"url-text": [" "], "url-href": ["javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(\"ctl00$PlaceHolderMain$ctl01$ctl00$ctl03\", \"\", true, \"\", \"\", false, true))"]}

Thoughts?

1 Answer 1

3

You are reading the response from scrapy and trying to work the code on the selenium page this won't work. You need to use the response from your selenium page and create a scrapy response object from the same.

import scrapy
from selenium import webdriver

class SampleSpider(scrapy.Spider):
    name = "sample"
    start_urls = ['url-to-scrape']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)
        res = response.replace(body=self.driver.page_source)

        for page in res.css('a'):
            yield {
                'url-href': page.xpath('@href').extract(),
                'url-text': page.css('::text').extract()
            }
        self.driver.quit()

Also time.sleep is not needed in this case

Sign up to request clarification or add additional context in comments.

2 Comments

wouldn't self.driver.get(response.url) have to be something like self.driver.get(response.body) instead?! doesn't response.url simply return the URL? i'm trying to figure out how to use scrapy and selenium side-by-side, but it is almost impossible to find info anywhere online regarding how to do this
@Anthony, response.body gives you plain HTML response, with JS not executed. That is why we use response.url and browser the same url in the selenium driver. Then we use self.driver.page_source to get the rendered HTML with javascript executed

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.