1

I'm familiar with scraping websites with Scrapy, however I cant seem to scrape this one (javascript perhaps ?).

I'm trying to download historical data for commodities for some personal research from this website: http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx

On this website you will have to select the date and then click go. Once the data is loaded, you can click 'View in Excel' to download a CSV file with commodity prices for that day. I'm trying to build a scraper to download these CSV files for a few months. However, this website seems like a hard nut to crack. Any help will be appreciated.

Things i've tried: 1) Look at the page source to see if data is being loaded but not shown (hidden) 2) Used firebug to see if there are any AJAX requests 3) Modified POST headers to see if I can get data for different days. The post headers seem very complicated.

1 Answer 1

2

Asp.net websites are notoriously hard to crawl because it relies on viewsessions, being extremely strict with requests and loads of other nonsense.

Luckily your case seems to be pretty straightforward. Your scrapy approach should look something like:

import scrapy
from scrapy import FormRequest

class MxindiaSpider(scrapy.Spider):
    name = "mxindia"
    allowed_domains = ["mcxindia.com"]
    start_urls = ('http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx',)

    def parse(self, response):
        yield FormRequest.from_response(response,
                                        formdata={
                                            'mTbdate': '02/13/2015',  # your date here
                                            'ScriptManager1': 'MupdPnl|mImgBtnGo',
                                            '__EVENTARGUMENT': '',
                                            '__EVENTTARGET': '',
                                            'mImgBtnGo.x': '12',
                                            'mImgBtnGo.y': '9'
                                        },
                                        callback=self.parse_cal, )

    def parse_cal(self, response):
        inspect_response(response, self)  # everything is there!

What we do here is create FormRequest from the response object we already have. It's mart enough to find the <input> and <form> fields and generates formdata. However some input fields that don't have defaults or we need to override the defaults need to be overriden with formdata argument. So we provide formdata argument with updated form values. When you inspect the request you can see all of the form values you need to make a successful request: enter image description here

So just copy all of them over to your formdata. Asp is really anal about the formdata so it takes some time experimenting what is required and what is not.

I'll leave you to figure out how to get to the next page yourself, usually it just adds aditional key to formadata like 'page': '2'.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @Granitosaurus! I was able to get the first page data using your answer. I'm playing around with the formdata to get content for page 2 etc.. looks like code'__EVENTTARGET': 'Page$2'code is to change pages, I will have more tinkering to do. I'm hopeful to get the csv, instead of crawling individual pages, maybe code__EVENTTARGET:btnLink_Excelcode is the way forward

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.