2

I want to download an xls file by clicking the button "Export to excel" from the following url: https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD.

More specifically the button: name = "ctl00$MainContent$btndata". I've already been able to do this using selenium, but, I plan on building a docker image with this script and running as a docker container because this xls is regularly updated and I need the most current data on my local machine and it doesn't make sense to have a browser open that often to fetch this data. I understand there are headless versions of chrome and firefox although I don't believe they support downloads. Also, I understand that web get will not work in this situation because the button is not a static link to the resource. Maybe there's a completely different approach for downloading and updating this data to my computer?

import urllib
import requests
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
    'Origin': 'https://www.tampagov.net',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
    'Accept-Encoding': 'gzip,deflate,br',
    'Accept-Language': 'en-US,en;q=0.5',
}

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f, "html.parser")
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']

formData = (
    ('__EVENTVALIDATION', eventvalidation),
    ('__VIEWSTATE', viewstate),
    ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
    ('Accept-Encoding', 'gzip, deflate, br'),
    ('Accept-Language', 'en-US,en;q=0.5'),
    ('Host', 'apps,tampagov.net'),
    ('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'))



payload = urllib.urlencode(formData)
# second HTTP request with form data
r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", params=payload)
print(r.status_code, r.reason)
20
  • 1
    You can look into the source code for that web page, then find out what to do for 'Export Excel' button; normally it will trigger an ajax request to one url. then in your script, simulate one ajax request to same url to get the excel data. You don't need to care the html content. Commented Mar 27, 2018 at 18:51
  • @Sphinx If I inspect the button then under Elements > Event Listeners > Click, It looks like there's some links to ajax requests? I really have no idea what im looking at though. I'm really bad w html & js. I'm trying to learn though lol Commented Mar 27, 2018 at 18:58
  • 1
    Actually, I found out how that button works. But I don't think it is a good idea to post it out. The owner of that website may kick my ass... But a hint, you can open web console, then switch to 'network' tab, then click 'Export to Excel' button, you should see one http 'POST' request in network tab. Finally in your script, simulate one same 'http post', you will get the data you need. Commented Mar 27, 2018 at 19:15
  • 1
    Another thing: pay attention on _VIEWSTATE, _EVENTVALIDATION, Recommand you google these two keywords to find out what they are. Commented Mar 27, 2018 at 19:20
  • I was able to find _VIEWSTATE and _EVENTVALIDATION in the network tab within the POST request and then read up on them for a few minutes, so when simulating the http post, this data obviously needs to be sent to the server so it thinks the button was clicked, but how do Include that in my script? Commented Mar 28, 2018 at 15:16

2 Answers 2

1

First: I removed import urllib because 'requests' is enough.

Some issues you have:

  1. You don't need to create one nested tuple then apply urllib.urlencode, uses one dictionary instead that is one reason why requests is so popular.

  2. You'd better populate all parameters for the http post request. like below what I did, otherwise, the request may be rejected by the backend.

  3. I added one simple codes to save the content to the local.

PS: for those form parameters, you can get their values by analysis the html responsed from http get. Also you can customize the parameters as you need, like page size etc.

Below is a working sample:

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

def downloadExcel():
    headers = {
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
        'Origin': 'https://www.tampagov.net',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
        'Accept-Encoding': 'gzip,deflate,br',
        'Accept-Language': 'en-US,en;q=0.5',
    }

    r = requests.get("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", headers=headers)
    # parse and retrieve two vital form values
    if not r.status_code == 200:
        print('Error')
        return
    soup = BeautifulSoup(r.content, "html.parser")
    viewstate = soup.select("#__VIEWSTATE")[0]['value']
    eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
    print ('__VIEWSTATE:', viewstate)
    print ('__EVENTVALIDATION:', eventvalidation)
    formData = {
        '__EVENTVALIDATION': eventvalidation,
        '__VIEWSTATE': viewstate,
        '__EVENTTARGET': '',
        '__EVENTARGUMENT': '',
        '__VIEWSTATEGENERATOR': '49DF2C80',
        'MainContent_RadScriptManager1_TSM':""";;System.Web.Extensions, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35:en-US:59e0a739-153b-40bd-883f-4e212fc43305:ea597d4b:b25378d2;Telerik.Web.UI, Version=2015.2.826.40, Culture=neutral, PublicKeyToken=121fae78165ba3d4:en-US:c2ba43dc-851e-4009-beab-3032480b6a4b:16e4e7cd:f7645509:24ee1bba:c128760b:874f8ea2:19620875:4877f69a:f46195d3:92fe8ea0:fa31b949:490a9d4e:bd8f85e4:58366029:ed16cbdc:2003d0b8:88144a7a:1e771326:aa288e2d:b092aa46:7c926187:8674cba1:ef347303:2e42e72a:b7778d6c:c08e9f8a:e330518b:c8618e41:e4f8f289:1a73651d:16d8629e:59462f1:a51ee93e""",
        'search_block_form':'',
        'ctl00$MainContent$btndata':'Export to Excel',
        'ctl00_MainContent_RadWindow1_C_RadGridVehicles_ClientState':'',
        'ctl00_MainContent_RadWindow1_ClientState':'',
        'ctl00_MainContent_RadWindowManager1_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl00$PageSizeComboBox':'20',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time$dateInput':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_dateInput_ClientState':'{"enabled":true,"emptyMessage":"","validationText":"","valueAsString":"","minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00","lastSetTextBoxValue":""}',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_ClientState':'{"minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00"}',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1address':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1address_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1case_description':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1case_description_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_grid':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1report_number':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1report_number_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_max_date':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_rowcount':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox':'20',
        'ctl00_MainContent_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState':'',
        'ctl00_MainContent_RadGrid1_rfltMenu_ClientState':'',
        'ctl00_MainContent_RadGrid1_gdtcSharedTimeView_ClientState':'',
        'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_SD':'[]',
        'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_AD':'[[1900,1,1],[2099,12,31],[2018,3,29]]',
        'ctl00_MainContent_RadGrid1_ClientState':'',
        }

    # second HTTP request with form data
    r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", data=formData, headers=headers)
    print('received:', r.status_code, len(r.content))
    with open(r"C:\Users\xxx\Desktop\test\test\apps.xls", "wb") as handle:
        for data in tqdm(r.iter_content()):
            handle.write(data)

downloadExcel()
Sign up to request clarification or add additional context in comments.

1 Comment

Since i'm on a linux machine I was getting some errors on line 68, specifically when trying to read the specified file where the mode was set to write+binary, but that was the easiest fix of this whole project. +rep to you my friend you've helped me so much @sphinx
0

Find out the URL you need to fetch as @Sphinx explains, and then simulate it using something similar to:

import urllib.request
import urllib.parse

data = urllib.parse.urlencode({...})
data = data.encode('ascii')

with urllib.request.urlopen("http://...", data) as fd:
    print(fd.read().decode('utf-8'))

Take a look at the documentation of urllib.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.