Fetching API with requests.Session

Question

I wrote a script that parses API on schedule (Tuesday-Saturday), downloading everything for the previous day.

import requests
import pandas as pd
from datetime import date, timedelta

# # This is what I'd normally use, but since there would be no data today, 
# # I assign specific date myself 
# DATE = (date.today() - timedelta(days=1)).strftime("%Y-%m-%d")
DATE = "2020-10-23"
URL = "https://spending.gov.ua/portal-api/v2/api/transactions/page/" 


def fetch(session, params):
    next_page, last_page = 0, 0
    while next_page <= last_page:
        params["page"] = next_page
        data = session.get(URL, params=params).json()
        yield pd.json_normalize(data.get("transactions"))\
                .assign(page=params.get("page"))
        next_page, last_page = next_page+1, data["count"] // data["pageSize"]
                
        
def fetch_all():
    with requests.Session() as session:
        params = {"page": 0, "pageSize": 100, "startdate": DATE, "enddate": DATE}
        yield from fetch(session, params)
        
        
if __name__ == "__main__":
    data = fetch_all()
    pd.concat(data).to_csv(f"data/{DATE}.csv", index=False)

Here I’m wondering about a couple of things.

Firstly, if I’m using requests.Session correctly.

I read in the documentation that:

The Session object allows you to persist certain parameters across requests. ... So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase.

I'm not sure whether that's the case here as I didn't notice any changes in the performance.

Secondly, if splitting code into two functions instead of one was a good idea.

Here I thought that it would be easier to maintain -- the underlying function fetch doesn't change while fetch_all potentially could. For example, I could feed a range of dates instead of a singe date, changing fetch_all to:

def fetch_all(date_range):
    with requests.Session() as session:
        for date in date_range:
            params = {"page": 0, "pageSize": 100, "startdate": date, "enddate": date}
            yield from fetch(session, params)

Also, the yield and yield from -- could've used .append and returned a list instead. Not sure which approach is better.

Setris · Accepted Answer · 2020-10-27 01:45:55Z

Here I’m wondering about a couple of things.

Firstly, if I’m using requests.Session correctly.

Yes, you are. In one of my other reviews, using requests.Session in the same way for iterating over a paginated API almost halved the total execution time.

I did some quick testing by downloading the last 7 pages (pages 1625-1631) for "2020-10-23" and it did marginally better than making requests with requests.get:

requests.get: 23.2 seconds
requests.Session: 17.7 seconds

Secondly, if splitting code into two functions instead of one was a good idea.

I think it's fine to have it split into two functions. That said, I do have some comments about the responsibilities and interface of fetch and how to better take advantage of your usages of yield and yield from down below.

Overall the code looks clean and is easy to read. Here's how I think it can be improved:

I think all the low-level details of how to issue requests to the API should be abstracted away from the caller of fetch. That is, fetch's function signature should look something like this:
```
def fetch(
    session: requests.Session,
    start_date: date,
    end_date: date,
    starting_page: int = 0,
    page_size: int = 100,
) -> Iterator[pd.DataFrame]:
    pass
```
So now creating an appropriate params would be fetch's responsibility, not fetch_all's. Notice also that start_date and end_date are of type datetime.date, not str. Similarly, fetch_all should not have to be concerned with what date string serialization format the API accepts; this is fetch's responsibility.

Within fetch, instead of maintaining variables next_page and last_page on each request, I think it would be better to calculate the total number of pages (n) only once with the first request (page k), then use a for loop for pages k+1..n-1:

def to_dataframe(json_data: Dict[str, Any], page: int) -> pd.DataFrame:
    return pd.json_normalize(json_data["transactions"]).assign(page=page)


def fetch(
    session: requests.Session,
    start_date: date,
    end_date: date,
    starting_page: int = 0,
    page_size: int = 100,
) -> Iterator[pd.DataFrame]:
    params = {
        "startdate": start_date.isoformat(),
        "enddate": end_date.isoformat(),
        "page": starting_page,
        "pageSize": page_size,
    }

    data = session.get(URL, params=params).json()
    page_count = math.ceil(data["count"] / data["pageSize"])
    last_page = page_count - 1
    if starting_page > last_page:
        return
    print(f"{starting_page} / {last_page}")
    yield to_dataframe(data, starting_page)

    for page in range(starting_page + 1, page_count):
        params["page"] = page
        data = session.get(URL, params=params).json()
        print(f"{page} / {last_page}")
        yield to_dataframe(data, page)

The tradeoff here is that there's a small duplication of code because the first request is handled a little differently, but now we've delegated responsibility of page number iteration to the for loop.

I recommend adding an event hook to the session object so that it always calls raise_for_status() on the response object. This ensures that all requests made with the session raise requests.HTTPError if the server gives us a 4xx or 5xx response, and prevents us from converting an error response's .json() data into a dataframe:
```
session.hooks["response"].append(
    lambda r, *args, **kwargs: r.raise_for_status()
)
```
Currently the program is combining all dataframes in memory before exporting it to a CSV file. To take advantage of fetch_all being an Iterator[pd.DataFrame], I think it would be better to write each dataframe to the CSV immediately, so we don't need to hold it in memory any longer than necessary:
```
output_path = Path(f"data/{DATE}.csv")
output_path.unlink(missing_ok=True)
data = fetch_all()
for i, dataframe in enumerate(data):
    write_header = True if i == 0 else False
    dataframe.to_csv(
        output_path, header=write_header, index=False, mode="a"
    )
```

Refactored version:

#!/usr/bin/env python3

import math
from datetime import date, timedelta
from pathlib import Path
from typing import Any, Dict, Iterator

import pandas as pd  # type: ignore
import requests

# # This is what I'd normally use, but since there would be no data today,
# # I assign specific date myself
# DATE = date.today() - timedelta(days=1)
DATE = date.fromisoformat("2020-10-23")
URL = "https://spending.gov.ua/portal-api/v2/api/transactions/page/"


def to_dataframe(json_data: Dict[str, Any], page: int) -> pd.DataFrame:
    return pd.json_normalize(json_data["transactions"]).assign(page=page)


def fetch(
    session: requests.Session,
    start_date: date,
    end_date: date,
    starting_page: int = 0,
    page_size: int = 100,
) -> Iterator[pd.DataFrame]:
    params = {
        "startdate": start_date.isoformat(),
        "enddate": end_date.isoformat(),
        "page": starting_page,
        "pageSize": page_size,
    }

    data = session.get(URL, params=params).json()
    page_count = math.ceil(data["count"] / data["pageSize"])
    last_page = page_count - 1
    if starting_page > last_page:
        return
    print(f"{starting_page} / {last_page}")
    yield to_dataframe(data, starting_page)

    for page in range(starting_page + 1, page_count):
        params["page"] = page
        data = session.get(URL, params=params).json()
        print(f"{page} / {last_page}")
        yield to_dataframe(data, page)


def fetch_all() -> Iterator[pd.DataFrame]:
    with requests.Session() as session:
        session.hooks["response"].append(
            lambda r, *args, **kwargs: r.raise_for_status()
        )
        yield from fetch(session, start_date=DATE, end_date=DATE)


if __name__ == "__main__":
    output_path = Path(f"data/{DATE}.csv")
    output_path.unlink(missing_ok=True)
    data = fetch_all()
    for i, dataframe in enumerate(data):
        write_header = True if i == 0 else False
        dataframe.to_csv(
            output_path, header=write_header, index=False, mode="a"
        )

Though I intentionally used while loop to avoid code duplication in fetch function. But I agree that your approach is more explicit, so I’ll use it instead. Thanks once again. — Hryhorii Pavlenko
– Hryhorii Pavlenko, Commented Oct 27, 2020 at 9:02

Stack Exchange Network

Fetching API with requests.Session

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Fetching API with requests.Session

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions