2

Given the following data structure

import polars as pl

df = pl.DataFrame(
    {
        "order_id": ["o01", "o02", "o03", "o04", "o10", "o11", "o12", "o13"],
        "customer_id": ["ca", "ca", "ca", "ca", "cb", "cb", "cb", "cb"],
        "date": [
            "2024-04-03",
            "2024-04-04",
            "2024-04-04",
            "2024-04-11",
            "2024-04-02",
            "2024-04-02",
            "2024-04-03",
            "2024-05-13",
        ],
    },
    schema_overrides={"date": pl.Date},
)

I would like to do some calculations over a rolling window. For that I would like to get a value from the current row (of a column that is not part of the window definition (i.e. partition or frame)), e.g. order_id in the following example, as well as a row index per frame (not partition).

So far I have (the orders column is just an illustration of abovementioned "calculation").

(
    df.sort("customer_id", "date")
    .rolling(
        index_column="date",
        period="1w",
        offset="0d",
        closed="left",
        group_by="customer_id",
    )
    .agg(
        frame_index=pl.int_range(pl.len()).first(),
        current_order_id=pl.col("order_id").first(),
        orders=pl.col("order_id"),
    )
)
customer_id date        frame_index current_order_id    orders
str         date        i64         str                 list[str]
"ca"        2024-04-03  0           "o01"               ["o01", "o02", "o03"]
"ca"        2024-04-04  0           "o02"               ["o02", "o03"]
"ca"        2024-04-04  0           "o02"               ["o02", "o03"]
"ca"        2024-04-11  0           "o04"               ["o04"]
"cb"        2024-04-02  0           "o10"               ["o10", "o11", "o12"]
"cb"        2024-04-02  0           "o10"               ["o10", "o11", "o12"]
"cb"        2024-04-03  0           "o12"               ["o12"]
"cb"        2024-05-13  0           "o13"               ["o13"]

But I would like to have (note the differences in frame_index and current_order_id in the 3rd and 6th row).

customer_id date        frame_index current_order_id    orders
str         date        i64         str                 list[str]
"ca"        2024-04-03  0           "o01"               ["o01", "o02", "o03"]
"ca"        2024-04-04  0           "o02"               ["o02", "o03"]
"ca"        2024-04-04  1           "o03"               ["o02", "o03"]
"ca"        2024-04-11  0           "o04"               ["o04"]
"cb"        2024-04-02  0           "o10"               ["o10", "o11", "o12"]
"cb"        2024-04-02  1           "o11"               ["o10", "o11", "o12"]
"cb"        2024-04-03  0           "o12"               ["o12"]
"cb"        2024-05-13  0           "o13"               ["o13"]

It seems to me that I am missing a current_row() or nth() expression, but there are probably other clever ways to achieve what I want with polars?

UPDATE: I just noticed that one can add a column from the original dataframe with with_column(df.select()), see my answer below.

So let's assume that I want to use a value from the current row in the agg step, e.g. to add or subtract it from a group mean or something.

2
  • I guess I could just slap the order_id from df onto df2 with pl.concat(..., how="horizonal"), but a) that looks icky and b) let's assume I would like to use that value in an expression on the rolling window. Commented Jan 15 at 12:35
  • This might be slightly related to github.com/pola-rs/polars-xdt/issues/85, IDK. Commented Jan 15 at 12:36

2 Answers 2

0

A possible workaround is to new_df.with_columns(orig_df.select("column_to_add")), if one just needs to add a column from the original dataframe.

(
    df.sort("customer_id", "date")
    .rolling(
        index_column="date",
        period="1w",
        offset="0d",
        closed="left",
        group_by="customer_id",
    )
    .agg(
        frame_index=pl.int_range(pl.len()),
        orders=pl.col("order_id"),
    ).with_columns(df.sort("customer_id", "date").select(current_order_id="order_id"))
)

UPDATE: @roman is completely right that df has to be sorted before we select a column and slap onto the new dataframe. Unfortunately this makes it similarly icky as pl.concat, since it's too easy to forget to adjust the sorting in two places.

Sign up to request clarification or add additional context in comments.

2 Comments

You would have to order order_id the same way as rolling query OR use join instead of simple concatenation
@roman you're right! But that makes it icky again. Not DRY and to easy to overlook that and shoot oneself in the foot.
0
(
    df.sort("customer_id", "date")
    .rolling(
        index_column="date",
        period="1w",
        offset="0d",
        closed="left",
        group_by="customer_id",
    )
    .agg(orders=pl.col.order_id)
    .with_columns(
        frame_index = pl.int_range(pl.len()).over("customer_id", pl.col.orders.rle_id()),
        # if all orders are unique then possible without rle
        # frame_index = pl.int_range(pl.len()).over("customer_id", "orders")
        current_order_id = df.sort("customer_id", "date")["order_id"]
    )
)
shape: (8, 5)
┌─────────────┬────────────┬───────────────────────┬─────────────┬──────────────────┐
│ customer_id ┆ date       ┆ orders                ┆ frame_index ┆ current_order_id │
│ ---         ┆ ---        ┆ ---                   ┆ ---         ┆ ---              │
│ str         ┆ date       ┆ list[str]             ┆ i64         ┆ str              │
╞═════════════╪════════════╪═══════════════════════╪═════════════╪══════════════════╡
│ ca          ┆ 2024-04-03 ┆ ["o01", "o02", "o03"] ┆ 0           ┆ o01              │
│ ca          ┆ 2024-04-04 ┆ ["o02", "o03"]        ┆ 0           ┆ o02              │
│ ca          ┆ 2024-04-04 ┆ ["o02", "o03"]        ┆ 1           ┆ o03              │
│ ca          ┆ 2024-04-11 ┆ ["o04"]               ┆ 0           ┆ o04              │
│ cb          ┆ 2024-04-02 ┆ ["o10", "o11", "o12"] ┆ 0           ┆ o10              │
│ cb          ┆ 2024-04-02 ┆ ["o10", "o11", "o12"] ┆ 1           ┆ o11              │
│ cb          ┆ 2024-04-03 ┆ ["o12"]               ┆ 0           ┆ o12              │
│ cb          ┆ 2024-05-13 ┆ ["o13"]               ┆ 0           ┆ o13              │
└─────────────┴────────────┴───────────────────────┴─────────────┴──────────────────┘

2 Comments

Thanks! This indeed recreates the frame_index for my (admittedly poorly chosen and therefore too) minimal example, but it doesn't create a real row index per frame. E.g. it doesn't work for offset="-1d".
@dpprdan oh I see - check updated answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.