I wrote a script that parses API on schedule (Tuesday-Saturday), downloading everything for the previous day.
import requests
import pandas as pd
from datetime import date, timedelta
# # This is what I'd normally use, but since there would be no data today,
# # I assign specific date myself
# DATE = (date.today() - timedelta(days=1)).strftime("%Y-%m-%d")
DATE = "2020-10-23"
URL = "https://spending.gov.ua/portal-api/v2/api/transactions/page/"
def fetch(session, params):
next_page, last_page = 0, 0
while next_page <= last_page:
params["page"] = next_page
data = session.get(URL, params=params).json()
yield pd.json_normalize(data.get("transactions"))\
.assign(page=params.get("page"))
next_page, last_page = next_page+1, data["count"] // data["pageSize"]
def fetch_all():
with requests.Session() as session:
params = {"page": 0, "pageSize": 100, "startdate": DATE, "enddate": DATE}
yield from fetch(session, params)
if __name__ == "__main__":
data = fetch_all()
pd.concat(data).to_csv(f"data/{DATE}.csv", index=False)
Here I’m wondering about a couple of things.
Firstly, if I’m using requests.Session correctly.
I read in the documentation that:
The Session object allows you to persist certain parameters across requests. ... So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase.
I'm not sure whether that's the case here as I didn't notice any changes in the performance.
Secondly, if splitting code into two functions instead of one was a good idea.
Here I thought that it would be easier to maintain -- the underlying function fetch doesn't change while fetch_all potentially could. For example, I could feed a range of dates instead of a singe date, changing fetch_all to:
def fetch_all(date_range):
with requests.Session() as session:
for date in date_range:
params = {"page": 0, "pageSize": 100, "startdate": date, "enddate": date}
yield from fetch(session, params)
Also, the yield and yield from -- could've used .append and returned a list instead. Not sure which approach is better.