I want to calculate the mean over some group column 'a' but include only one value per second group column 'b'.
Constraints:
- I want to preserve all original records in the result.
- (if possible) avoid self-joins
- The solution should be compatible with the streaming engine in the sense that it avoids in-memory-maps.
For example I have this dataframe
import polars as pl
df = pl.LazyFrame({
"a": ["X", "Y", "Y", "Y", "Y"],
"b": ["u", "v", "w", "w", "w"],
"c": [1, 2, 3, 3, 3],
})
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪═════╪═════╡
│ X ┆ u ┆ 1 │
│ Y ┆ v ┆ 2 │
│ Y ┆ w ┆ 3 │
│ Y ┆ w ┆ 3 │
│ Y ┆ w ┆ 3 │
└─────┴─────┴─────┘
I have tried this which gives an error
df.with_columns(
pl.col("c").first().over("b").mean().over("a").alias("mean")
).collect(engine="streaming")
# polars.exceptions.InvalidOperationError: window expression not allowed in aggregation
The best I have found so far is introducing a helper column "cum_count"
(
df.with_columns(pl.col("b").cum_count().over("b").alias("cum_count"))
.with_columns(pl.col("c").filter(pl.col("cum_count") == 1).mean().over("a").alias("mean"))
).collect(engine="streaming")
which gives the expected result
┌─────┬─────┬─────┬───────────┬──────┐
│ a ┆ b ┆ c ┆ cum_count ┆ mean │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ u32 ┆ f64 │
╞═════╪═════╪═════╪═══════════╪══════╡
│ X ┆ u ┆ 1 ┆ 1 ┆ 1.0 │
│ Y ┆ v ┆ 2 ┆ 1 ┆ 2.5 │
│ Y ┆ w ┆ 3 ┆ 1 ┆ 2.5 │
│ Y ┆ w ┆ 3 ┆ 2 ┆ 2.5 │
│ Y ┆ w ┆ 3 ┆ 3 ┆ 2.5 │
└─────┴─────┴─────┴───────────┴──────┘
but requires an in-memory-map when inspecting .show_graph(plan_stage="physical", engine="streaming").
I wonder if this problem is fundamentally non-streamable or if there is a better solution that avoids in-memory-maps?