Underlying Problem
In general, all expressions within a (with_columns, select, filter, group_by) context are evaluated in parallel. Especially, there are no columns previously created within the same context.
Solution
Still, you can avoid writing large expressions multiple times, by saving the expression to a variable.
import polars as pl
df = pl.DataFrame({
"a": [1],
"b": [2],
"c": [3],
})
ab_expr = pl.col("a") + pl.col("b")
df.with_columns(
ab_expr.alias("ab"),
(ab_expr + pl.col("c")).alias("abc"),
)
shape: (1, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ ab ┆ abc │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 3 ┆ 6 │
└─────┴─────┴─────┴─────┴─────┘
Note that polar's query plan optimization accounts for the joint sub-plan and the computation doesn't necessarily happen twice. This can be checked as follows.
ab_expr = pl.col("a") + pl.col("b")
(
df
.lazy()
.with_columns(
ab_expr.alias("ab"),
(ab_expr + pl.col("c")).alias("abc"),
)
.explain()
)
simple π 5/6 ["a", "b", "c", "ab", "abc"]
WITH_COLUMNS:
[col("__POLARS_CSER_0xd4acad4332698399").alias("ab"), [(col("__POLARS_CSER_0xd4acad4332698399")) + (col("c"))].alias("abc")]
WITH_COLUMNS:
[[(col("a")) + (col("b"))].alias("__POLARS_CSER_0xd4acad4332698399")]
DF ["a", "b", "c"]; PROJECT */3 COLUMNS
Especially, polars is aware of the sub-plan __POLARS_CSER_0xd4acad4332698399 shared between expressions.
Syntacic Sugar (?)
Moreover, the walrus operation might be used to do the variable assignment within the context.
df.with_columns(
(ab_expr := pl.col("a") + pl.col("b")).alias("ab"),
(ab_expr + pl.col("c")).alias("abc"),
)
polars.exceptions.ColumnNotFoundError: ABwith_columns_seqdisables parallelism but it still cannot refer to columns that don't exist. github.com/pola-rs/polars/issues/14935#issuecomment-2053723695