get value from current row in rolling window

Question

Given the following data structure

import polars as pl

df = pl.DataFrame(
    {
        "order_id": ["o01", "o02", "o03", "o04", "o10", "o11", "o12", "o13"],
        "customer_id": ["ca", "ca", "ca", "ca", "cb", "cb", "cb", "cb"],
        "date": [
            "2024-04-03",
            "2024-04-04",
            "2024-04-04",
            "2024-04-11",
            "2024-04-02",
            "2024-04-02",
            "2024-04-03",
            "2024-05-13",
        ],
    },
    schema_overrides={"date": pl.Date},
)

I would like to do some calculations over a rolling window. For that I would like to get a value from the current row (of a column that is not part of the window definition (i.e. partition or frame)), e.g. order_id in the following example, as well as a row index per frame (not partition).

So far I have (the orders column is just an illustration of abovementioned "calculation").

(
    df.sort("customer_id", "date")
    .rolling(
        index_column="date",
        period="1w",
        offset="0d",
        closed="left",
        group_by="customer_id",
    )
    .agg(
        frame_index=pl.int_range(pl.len()).first(),
        current_order_id=pl.col("order_id").first(),
        orders=pl.col("order_id"),
    )
)

customer_id date        frame_index current_order_id    orders
str         date        i64         str                 list[str]
"ca"        2024-04-03  0           "o01"               ["o01", "o02", "o03"]
"ca"        2024-04-04  0           "o02"               ["o02", "o03"]
"ca"        2024-04-04  0           "o02"               ["o02", "o03"]
"ca"        2024-04-11  0           "o04"               ["o04"]
"cb"        2024-04-02  0           "o10"               ["o10", "o11", "o12"]
"cb"        2024-04-02  0           "o10"               ["o10", "o11", "o12"]
"cb"        2024-04-03  0           "o12"               ["o12"]
"cb"        2024-05-13  0           "o13"               ["o13"]

But I would like to have (note the differences in frame_index and current_order_id in the 3rd and 6th row).

customer_id date        frame_index current_order_id    orders
str         date        i64         str                 list[str]
"ca"        2024-04-03  0           "o01"               ["o01", "o02", "o03"]
"ca"        2024-04-04  0           "o02"               ["o02", "o03"]
"ca"        2024-04-04  1           "o03"               ["o02", "o03"]
"ca"        2024-04-11  0           "o04"               ["o04"]
"cb"        2024-04-02  0           "o10"               ["o10", "o11", "o12"]
"cb"        2024-04-02  1           "o11"               ["o10", "o11", "o12"]
"cb"        2024-04-03  0           "o12"               ["o12"]
"cb"        2024-05-13  0           "o13"               ["o13"]

It seems to me that I am missing a current_row() or nth() expression, but there are probably other clever ways to achieve what I want with polars?

UPDATE: I just noticed that one can add a column from the original dataframe with with_column(df.select()), see my answer below.

So let's assume that I want to use a value from the current row in the agg step, e.g. to add or subtract it from a group mean or something.

I guess I could just slap the order_id from df onto df2 with pl.concat(..., how="horizonal"), but a) that looks icky and b) let's assume I would like to use that value in an expression on the rolling window. — dpprdan
– dpprdan, Commented Jan 15 at 12:35
This might be slightly related to github.com/pola-rs/polars-xdt/issues/85, IDK. — dpprdan
– dpprdan, Commented Jan 15 at 12:36

dpprdan · Accepted Answer · 2025-01-15 14:39:28Z

0

A possible workaround is to new_df.with_columns(orig_df.select("column_to_add")), if one just needs to add a column from the original dataframe.

(
    df.sort("customer_id", "date")
    .rolling(
        index_column="date",
        period="1w",
        offset="0d",
        closed="left",
        group_by="customer_id",
    )
    .agg(
        frame_index=pl.int_range(pl.len()),
        orders=pl.col("order_id"),
    ).with_columns(df.sort("customer_id", "date").select(current_order_id="order_id"))
)

UPDATE: @roman is completely right that df has to be sorted before we select a column and slap onto the new dataframe. Unfortunately this makes it similarly icky as pl.concat, since it's too easy to forget to adjust the sorting in two places.

edited Jan 15 at 14:39

answered Jan 15 at 14:00

dpprdan

1,81713 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

roman Jan 15 at 14:04

You would have to order order_id the same way as rolling query OR use join instead of simple concatenation

dpprdan Jan 15 at 14:22

@roman you're right! But that makes it icky again. Not DRY and to easy to overlook that and shoot oneself in the foot.

roman · Accepted Answer · 2025-01-15 15:01:23Z

(
    df.sort("customer_id", "date")
    .rolling(
        index_column="date",
        period="1w",
        offset="0d",
        closed="left",
        group_by="customer_id",
    )
    .agg(orders=pl.col.order_id)
    .with_columns(
        frame_index = pl.int_range(pl.len()).over("customer_id", pl.col.orders.rle_id()),
        # if all orders are unique then possible without rle
        # frame_index = pl.int_range(pl.len()).over("customer_id", "orders")
        current_order_id = df.sort("customer_id", "date")["order_id"]
    )
)

shape: (8, 5)
┌─────────────┬────────────┬───────────────────────┬─────────────┬──────────────────┐
│ customer_id ┆ date       ┆ orders                ┆ frame_index ┆ current_order_id │
│ ---         ┆ ---        ┆ ---                   ┆ ---         ┆ ---              │
│ str         ┆ date       ┆ list[str]             ┆ i64         ┆ str              │
╞═════════════╪════════════╪═══════════════════════╪═════════════╪══════════════════╡
│ ca          ┆ 2024-04-03 ┆ ["o01", "o02", "o03"] ┆ 0           ┆ o01              │
│ ca          ┆ 2024-04-04 ┆ ["o02", "o03"]        ┆ 0           ┆ o02              │
│ ca          ┆ 2024-04-04 ┆ ["o02", "o03"]        ┆ 1           ┆ o03              │
│ ca          ┆ 2024-04-11 ┆ ["o04"]               ┆ 0           ┆ o04              │
│ cb          ┆ 2024-04-02 ┆ ["o10", "o11", "o12"] ┆ 0           ┆ o10              │
│ cb          ┆ 2024-04-02 ┆ ["o10", "o11", "o12"] ┆ 1           ┆ o11              │
│ cb          ┆ 2024-04-03 ┆ ["o12"]               ┆ 0           ┆ o12              │
│ cb          ┆ 2024-05-13 ┆ ["o13"]               ┆ 0           ┆ o13              │
└─────────────┴────────────┴───────────────────────┴─────────────┴──────────────────┘

Thanks! This indeed recreates the frame_index for my (admittedly poorly chosen and therefore too) minimal example, but it doesn't create a real row index per frame. E.g. it doesn't work for offset="-1d".

Collectives™ on Stack Overflow

get value from current row in rolling window

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related