Create column from other columns created within same `with_columns` context

Question

Here, column "AB" is just being created and at the same time is being used as input to create column "ABC". This fails.

df = df.with_columns(
  (pl.col("A")+pl.col("B")).alias("AB"),
  (pl.col("AB")+pl.col("C")).alias("ABC")
)

The only way to achieve the desired result is a second call to with_columns.

df1 = df.with_columns(
  (pl.col("A")+pl.col("B")).alias("AB")
)
df2 = df1.with_columns(
  (pl.col("AB")+pl.col("C")).alias("ABC")
)

@Rodalm. For some reason it doesn't find a col previously created -- polars.exceptions.ColumnNotFoundError: AB — Nip
– Nip, Commented Feb 21 at 17:01
with_columns_seq disables parallelism but it still cannot refer to columns that don't exist. github.com/pola-rs/polars/issues/14935#issuecomment-2053723695 — jqurious
– jqurious, Commented Feb 21 at 17:18
My bad, I had the wrong idea of what the method does. @jqurious thanks for the clarification — Rodalm
– Rodalm, Commented Feb 21 at 18:01

Hericks · Accepted Answer · 2025-02-21 17:07:56Z

2

Underlying Problem

In general, all expressions within a (with_columns, select, filter, group_by) context are evaluated in parallel. Especially, there are no columns previously created within the same context.

Solution

Still, you can avoid writing large expressions multiple times, by saving the expression to a variable.

import polars as pl

df = pl.DataFrame({
    "a": [1],
    "b": [2],
    "c": [3],
})

ab_expr = pl.col("a") + pl.col("b")
df.with_columns(
    ab_expr.alias("ab"),
    (ab_expr + pl.col("c")).alias("abc"),
)

shape: (1, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c   ┆ ab  ┆ abc │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   ┆ 3   ┆ 6   │
└─────┴─────┴─────┴─────┴─────┘

Note that polar's query plan optimization accounts for the joint sub-plan and the computation doesn't necessarily happen twice. This can be checked as follows.

ab_expr = pl.col("a") + pl.col("b")
(
    df
    .lazy()
    .with_columns(
        ab_expr.alias("ab"),
        (ab_expr + pl.col("c")).alias("abc"),
    )
    .explain()
)

simple π 5/6 ["a", "b", "c", "ab", "abc"]
   WITH_COLUMNS:
   [col("__POLARS_CSER_0xd4acad4332698399").alias("ab"), [(col("__POLARS_CSER_0xd4acad4332698399")) + (col("c"))].alias("abc")] 
     WITH_COLUMNS:
     [[(col("a")) + (col("b"))].alias("__POLARS_CSER_0xd4acad4332698399")] 
      DF ["a", "b", "c"]; PROJECT */3 COLUMNS

Especially, polars is aware of the sub-plan __POLARS_CSER_0xd4acad4332698399 shared between expressions.

Syntacic Sugar (?)

Moreover, the walrus operation might be used to do the variable assignment within the context.

df.with_columns(
    (ab_expr := pl.col("a") + pl.col("b")).alias("ab"),
    (ab_expr + pl.col("c")).alias("abc"),
)

edited Feb 21 at 17:07

answered Feb 21 at 17:01

Hericks

12.9k3 gold badges35 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nip Feb 21 at 17:13

Good thinking. Still, I think would be even better If we could use a column created previously within the same context as input to create another col within this context.

Hericks Feb 21 at 17:22

@Nip Thanks! FYI. There is a relevant issue / feature request here. TLDR. Currently, there is no plan to change the semantics of a context.

Dean MacGregor · Accepted Answer · 2025-02-21 20:09:43Z

I've had this in mind for a bit. Here's a function that gets you the behavior you're looking for.

import polars as pl
from io import StringIO


def aug_exprs(*exprs, **named_exprs):
    no_aliases = {}
    for expr in exprs:
        name = expr.meta.output_name()
        no_aliases[name] = expr.meta.undo_aliases().meta.serialize(format="json")
    for name, expr in named_exprs.items():
        no_aliases[name] = expr.meta.undo_aliases().meta.serialize(format="json")
    no_sub = True
    while no_sub:
        for one in no_aliases.keys():
            for name, expr in no_aliases.items():
                if name == one:
                    continue
                str_col = "{" + f'"Column":"{one}"' + "}"
                if str_col in expr:
                    no_sub = False
                    expr = expr.replace(str_col, no_aliases[one])
                    no_aliases[name] = expr

    return [
        pl.Expr.deserialize(StringIO(expr), format="json").alias(name)
        for name, expr in no_aliases.items()
    ]

Essentially this function is used in between the expressions you type and the context you want to use. It looks at each expression to see if one of the output names of one appears as an input to another. When that happens it replaces the column reference with the whole definition of that column.

If you have that function then you can do:

df=pl.DataFrame({"a":[1,2,3]})

df.lazy().select(aug_exprs(
    (pl.col("a")*2).alias("b"), 
    (pl.col("b")+2).alias("c"), 
    (pl.col("c")-3).alias("d"))
).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ b   ┆ c   ┆ d   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 2   ┆ 4   ┆ 1   │
│ 4   ┆ 6   ┆ 3   │
│ 6   ┆ 8   ┆ 5   │
└─────┴─────┴─────┘

You'll want to only do this in lazy mode so that CSE won't recalc the columns. This is not extensively tested by any means and I'd only use it for interactive sessions where you see if it breaks instantly.

Collectives™ on Stack Overflow

Create column from other columns created within same `with_columns` context

2 Answers 2

Underlying Problem

Solution

Syntacic Sugar (?)

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Underlying Problem

Solution

Syntacic Sugar (?)

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related