0

How do I scrape data using Scrapy Framework from websites which loads data using javascript frameworks? Scrapy download the html from each page requests but some website uses js frameworks like Angular or VueJs which will load data separately.

I have tried using a combination of Scrapy,Selenium and chrome driver to retrieve the htmls which gives the rendered html with content. But when using this method I am not able to retain the session cookies set for selecting country and currency as each request is handled by a separate instance of selenium or chrome.

Please suggest if there is any alternative options to scrape the dynamic content while retaining the session.

Adding the code which i used to set the country and currency

import scrapy
from selenium import webdriver

class SettingSpider(scrapy.Spider):
    name = 'setting'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def start_requests(self):
        url = 'http://www.example.com/intl/settings'
        self.driver.get(response.url)
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        csrf = response.xpath('//input[@name="CSRFToken"]/@value').extract_first().strip()
        print('------------->' + csrf)
        url = 'http://www.example.com/intl/settings'

        form_data = {'shippingCountry': 'ARE', 'language': 'en', 'billingCurrency': 'USD', 'indicativeCurrency': '',
                     'CSRFToken:': csrf}
        yield scrapy.FormRequest(url, formdata=form_data, callback=self.after_post)

1 Answer 1

1

what you said

as each request is handled by a separate instance of selenium or chrome

is not correct,

You can continue to use Selenium and i suggest you to use phantomJS instead of chrome. i can't help more because you didn't put your code.

one example for phantomJS:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 800)
driver.get("https://example.com/")
driver.close()

and if you don't want to use Selenium, you can use Splash

Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python 3 using Twisted and QT5

as @Granitosaurus said in this question

Bonus points for it being developed by the same guys who are developing Scrapy.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.