Scrapy - Javascript website

Question

I'm familiar with scraping websites with Scrapy, however I cant seem to scrape this one (javascript perhaps ?).

I'm trying to download historical data for commodities for some personal research from this website: http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx

On this website you will have to select the date and then click go. Once the data is loaded, you can click 'View in Excel' to download a CSV file with commodity prices for that day. I'm trying to build a scraper to download these CSV files for a few months. However, this website seems like a hard nut to crack. Any help will be appreciated.

Things i've tried: 1) Look at the page source to see if data is being loaded but not shown (hidden) 2) Used firebug to see if there are any AJAX requests 3) Modified POST headers to see if I can get data for different days. The post headers seem very complicated.

Granitosaurus · Accepted Answer · 2016-02-16 12:26:24Z

2

Asp.net websites are notoriously hard to crawl because it relies on viewsessions, being extremely strict with requests and loads of other nonsense.

Luckily your case seems to be pretty straightforward. Your scrapy approach should look something like:

import scrapy
from scrapy import FormRequest

class MxindiaSpider(scrapy.Spider):
    name = "mxindia"
    allowed_domains = ["mcxindia.com"]
    start_urls = ('http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx',)

    def parse(self, response):
        yield FormRequest.from_response(response,
                                        formdata={
                                            'mTbdate': '02/13/2015',  # your date here
                                            'ScriptManager1': 'MupdPnl|mImgBtnGo',
                                            '__EVENTARGUMENT': '',
                                            '__EVENTTARGET': '',
                                            'mImgBtnGo.x': '12',
                                            'mImgBtnGo.y': '9'
                                        },
                                        callback=self.parse_cal, )

    def parse_cal(self, response):
        inspect_response(response, self)  # everything is there!

What we do here is create FormRequest from the response object we already have. It's mart enough to find the <input> and <form> fields and generates formdata. However some input fields that don't have defaults or we need to override the defaults need to be overriden with formdata argument. So we provide formdata argument with updated form values. When you inspect the request you can see all of the form values you need to make a successful request:

So just copy all of them over to your formdata. Asp is really anal about the formdata so it takes some time experimenting what is required and what is not.

I'll leave you to figure out how to get to the next page yourself, usually it just adds aditional key to formadata like 'page': '2'.

edited Feb 16, 2016 at 12:26

answered Feb 16, 2016 at 10:46

Granitosaurus

21.6k6 gold badges64 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ramashekara Over a year ago

Thanks @Granitosaurus! I was able to get the first page data using your answer. I'm playing around with the formdata to get content for page 2 etc.. looks like code'__EVENTTARGET': 'Page$2'code is to change pages, I will have more tinkering to do. I'm hopeful to get the csv, instead of crawling individual pages, maybe code__EVENTTARGET:btnLink_Excelcode is the way forward

Collectives™ on Stack Overflow

Scrapy - Javascript website

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related