Polars streaming: How to compute a nested window aggregation while avoiding in-memory-maps?

Question

I want to calculate the mean over some group column 'a' but include only one value per second group column 'b'.

Constraints:

I want to preserve all original records in the result.
(if possible) avoid self-joins
The solution should be compatible with the streaming engine in the sense that it avoids in-memory-maps.

For example I have this dataframe

import polars as pl

df = pl.LazyFrame({
    "a": ["X", "Y", "Y", "Y", "Y"],
    "b": ["u", "v", "w", "w", "w"],
    "c": [1, 2, 3, 3, 3],
})

┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪═════╪═════╡
│ X   ┆ u   ┆ 1   │
│ Y   ┆ v   ┆ 2   │
│ Y   ┆ w   ┆ 3   │
│ Y   ┆ w   ┆ 3   │
│ Y   ┆ w   ┆ 3   │
└─────┴─────┴─────┘

I have tried this which gives an error

df.with_columns(
    pl.col("c").first().over("b").mean().over("a").alias("mean")
).collect(engine="streaming")
# polars.exceptions.InvalidOperationError: window expression not allowed in aggregation

The best I have found so far is introducing a helper column "cum_count"

(
    df.with_columns(pl.col("b").cum_count().over("b").alias("cum_count"))
    .with_columns(pl.col("c").filter(pl.col("cum_count") == 1).mean().over("a").alias("mean"))
).collect(engine="streaming")

which gives the expected result

┌─────┬─────┬─────┬───────────┬──────┐
│ a   ┆ b   ┆ c   ┆ cum_count ┆ mean │
│ --- ┆ --- ┆ --- ┆ ---       ┆ ---  │
│ str ┆ str ┆ i64 ┆ u32       ┆ f64  │
╞═════╪═════╪═════╪═══════════╪══════╡
│ X   ┆ u   ┆ 1   ┆ 1         ┆ 1.0  │
│ Y   ┆ v   ┆ 2   ┆ 1         ┆ 2.5  │
│ Y   ┆ w   ┆ 3   ┆ 1         ┆ 2.5  │
│ Y   ┆ w   ┆ 3   ┆ 2         ┆ 2.5  │
│ Y   ┆ w   ┆ 3   ┆ 3         ┆ 2.5  │
└─────┴─────┴─────┴───────────┴──────┘

but requires an in-memory-map when inspecting .show_graph(plan_stage="physical", engine="streaming").

I wonder if this problem is fundamentally non-streamable or if there is a better solution that avoids in-memory-maps?

etrotta · Accepted Answer · 2025-10-31 13:00:44Z

4

Even without the 'only one value per second' constraint, over() is not "supported" without fallbacks yet as of version 1.35.1

Using group by and joining instead:

first = df.unique(subset=['b', 'c'], keep='first')
mean = first.group_by('a').agg(pl.col('c').mean().alias('mean'))
result = df.join(mean, on='a')

Even if polars implements another way of doing this, odds are that it'll do it via a group by followed by a join under the hood (the Issue literally says ".over() to group-by + join" under Plan translation to streaming), the same way unique() also becomes a group_by

answered Oct 31 at 13:00

etrotta

1,0701 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Polars streaming: How to compute a nested window aggregation while avoiding in-memory-maps?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related