6

I want to calculate the mean over some group column 'a' but include only one value per second group column 'b'.

Constraints:

  • I want to preserve all original records in the result.
  • (if possible) avoid self-joins
  • The solution should be compatible with the streaming engine in the sense that it avoids in-memory-maps.

For example I have this dataframe

import polars as pl

df = pl.LazyFrame({
    "a": ["X", "Y", "Y", "Y", "Y"],
    "b": ["u", "v", "w", "w", "w"],
    "c": [1, 2, 3, 3, 3],
})
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪═════╪═════╡
│ X   ┆ u   ┆ 1   │
│ Y   ┆ v   ┆ 2   │
│ Y   ┆ w   ┆ 3   │
│ Y   ┆ w   ┆ 3   │
│ Y   ┆ w   ┆ 3   │
└─────┴─────┴─────┘

I have tried this which gives an error

df.with_columns(
    pl.col("c").first().over("b").mean().over("a").alias("mean")
).collect(engine="streaming")
# polars.exceptions.InvalidOperationError: window expression not allowed in aggregation

The best I have found so far is introducing a helper column "cum_count"

(
    df.with_columns(pl.col("b").cum_count().over("b").alias("cum_count"))
    .with_columns(pl.col("c").filter(pl.col("cum_count") == 1).mean().over("a").alias("mean"))
).collect(engine="streaming")

which gives the expected result

┌─────┬─────┬─────┬───────────┬──────┐
│ a   ┆ b   ┆ c   ┆ cum_count ┆ mean │
│ --- ┆ --- ┆ --- ┆ ---       ┆ ---  │
│ str ┆ str ┆ i64 ┆ u32       ┆ f64  │
╞═════╪═════╪═════╪═══════════╪══════╡
│ X   ┆ u   ┆ 1   ┆ 1         ┆ 1.0  │
│ Y   ┆ v   ┆ 2   ┆ 1         ┆ 2.5  │
│ Y   ┆ w   ┆ 3   ┆ 1         ┆ 2.5  │
│ Y   ┆ w   ┆ 3   ┆ 2         ┆ 2.5  │
│ Y   ┆ w   ┆ 3   ┆ 3         ┆ 2.5  │
└─────┴─────┴─────┴───────────┴──────┘

but requires an in-memory-map when inspecting .show_graph(plan_stage="physical", engine="streaming").

I wonder if this problem is fundamentally non-streamable or if there is a better solution that avoids in-memory-maps?

1 Answer 1

4

Even without the 'only one value per second' constraint, over() is not "supported" without fallbacks yet as of version 1.35.1

Using group by and joining instead:

first = df.unique(subset=['b', 'c'], keep='first')
mean = first.group_by('a').agg(pl.col('c').mean().alias('mean'))
result = df.join(mean, on='a')

Even if polars implements another way of doing this, odds are that it'll do it via a group by followed by a join under the hood (the Issue literally says ".over() to group-by + join" under Plan translation to streaming), the same way unique() also becomes a group_by

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.