i'm trying to scrape: https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx, which in paper seems like a easy task and with a lot of resources from other SO questions. Nonetheless, I'm getting the same error no matter how I change my request.
I've tried the following:
import requests
from bs4 import BeautifulSoup
url = "https://apps.neb-one.gc.ca/CommodityStatistics/Statistics.aspx"
with requests.Session() as s:
s.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
response = s.get(url)
soup = BeautifulSoup(response.content)
data = {
"ctl00$MainContent$rdoCommoditySystem": "ELEC",
"ctl00$MainContent$lbReportName": "171",
"ctl00$MainContent$ddlFrom": "01/11/2018 12:00:00 AM",
"ctl00$MainContent$rdoReportFormat": "Excel",
"ctl00$MainContent$btnView": "View",
"__EVENTVALIDATION": soup.find('input', {'name':'__EVENTVALIDATION'}).get('value',''),
"__VIEWSTATE": soup.find('input', {'name': '__VIEWSTATE'}).get('value', ''),
"__VIEWSTATEGENERATOR": soup.find('input', {'name': '__VIEWSTATEGENERATOR'}).get('value', '')
}
response = requests.post(url, data=data)
When I print the response.contents object, I get this message (tl;dr, it says that "System error occurred. The system will alert technical support of the problem"):
b'\r\n\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n\r\n<html xmlns="http://www.w3.org/1999/xhtml" >\r\n<head><title>\r\n\r\n</title></head>\r\n<body>\r\n <form name="form1" method="post" action="Error.aspx?ErrorID=86e0c980-7832-4fc5-b5a8-a8254dd8ad69" id="form1">\r\n<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMTg3NjI4NzkzNmRkaCA5IA9393/t2iMAptLYU1QiPc8=" />\r\n\r\n<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="9D6BDE45" />\r\n <div>\r\n <h4>\r\n <span id="lblError">Error</span>\r\n </h4>\r\n <span id="lblMessage" class="Validator"><font color="Black">System error occurred. The system will alert technical support of the problem.</font></span>\r\n </div>\r\n </form>\r\n</body>\r\n</html>\r\n'
I have used other options, like change the __EVENTTARGET argument, as suggested here, and also pass the cookie from the first request to the POST request. Checking the source of the page, I noticed that the form has a "query" function that need the __EVENTTARGET and __EVENTARGUMENT to work:
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
But both arguments are empty (as can be checked in the Chrome developer inspector) in the body of the POST response. Another problem is that I need to either download the file in any of the formats (PDF or Excel), or get the HTML version, but the .ASPX form do not render the information in the same page, it open a new url: https://apps.neb-one.gc.ca/CommodityStatistics/ViewReport.aspx with the information instead.
I am kind of lost here, what I am missing?