Given the following data structure
import polars as pl
df = pl.DataFrame(
{
"order_id": ["o01", "o02", "o03", "o04", "o10", "o11", "o12", "o13"],
"customer_id": ["ca", "ca", "ca", "ca", "cb", "cb", "cb", "cb"],
"date": [
"2024-04-03",
"2024-04-04",
"2024-04-04",
"2024-04-11",
"2024-04-02",
"2024-04-02",
"2024-04-03",
"2024-05-13",
],
},
schema_overrides={"date": pl.Date},
)
I would like to do some calculations over a rolling window. For that I would like to get a value from the current row (of a column that is not part of the window definition (i.e. partition or frame)), e.g. order_id in the following example, as well as a row index per frame (not partition).
So far I have (the orders column is just an illustration of abovementioned "calculation").
(
df.sort("customer_id", "date")
.rolling(
index_column="date",
period="1w",
offset="0d",
closed="left",
group_by="customer_id",
)
.agg(
frame_index=pl.int_range(pl.len()).first(),
current_order_id=pl.col("order_id").first(),
orders=pl.col("order_id"),
)
)
customer_id date frame_index current_order_id orders
str date i64 str list[str]
"ca" 2024-04-03 0 "o01" ["o01", "o02", "o03"]
"ca" 2024-04-04 0 "o02" ["o02", "o03"]
"ca" 2024-04-04 0 "o02" ["o02", "o03"]
"ca" 2024-04-11 0 "o04" ["o04"]
"cb" 2024-04-02 0 "o10" ["o10", "o11", "o12"]
"cb" 2024-04-02 0 "o10" ["o10", "o11", "o12"]
"cb" 2024-04-03 0 "o12" ["o12"]
"cb" 2024-05-13 0 "o13" ["o13"]
But I would like to have (note the differences in frame_index and current_order_id in the 3rd and 6th row).
customer_id date frame_index current_order_id orders
str date i64 str list[str]
"ca" 2024-04-03 0 "o01" ["o01", "o02", "o03"]
"ca" 2024-04-04 0 "o02" ["o02", "o03"]
"ca" 2024-04-04 1 "o03" ["o02", "o03"]
"ca" 2024-04-11 0 "o04" ["o04"]
"cb" 2024-04-02 0 "o10" ["o10", "o11", "o12"]
"cb" 2024-04-02 1 "o11" ["o10", "o11", "o12"]
"cb" 2024-04-03 0 "o12" ["o12"]
"cb" 2024-05-13 0 "o13" ["o13"]
It seems to me that I am missing a current_row() or nth() expression, but there are probably other clever ways to achieve what I want with polars?
UPDATE: I just noticed that one can add a column from the original dataframe with with_column(df.select()), see my answer below.
So let's assume that I want to use a value from the current row in the agg step, e.g. to add or subtract it from a group mean or something.
order_idfrom df onto df2 withpl.concat(..., how="horizonal"), but a) that looks icky and b) let's assume I would like to use that value in an expression on the rolling window.