2

Here, column "AB" is just being created and at the same time is being used as input to create column "ABC". This fails.

df = df.with_columns(
  (pl.col("A")+pl.col("B")).alias("AB"),
  (pl.col("AB")+pl.col("C")).alias("ABC")
) 

The only way to achieve the desired result is a second call to with_columns.

df1 = df.with_columns(
  (pl.col("A")+pl.col("B")).alias("AB")
)
df2 = df1.with_columns(
  (pl.col("AB")+pl.col("C")).alias("ABC")
) 
4
  • 1
    Check the polars.DataFrame.with_columns_seq method Commented Feb 21 at 15:43
  • @Rodalm. For some reason it doesn't find a col previously created -- polars.exceptions.ColumnNotFoundError: AB Commented Feb 21 at 17:01
  • 1
    with_columns_seq disables parallelism but it still cannot refer to columns that don't exist. github.com/pola-rs/polars/issues/14935#issuecomment-2053723695 Commented Feb 21 at 17:18
  • My bad, I had the wrong idea of what the method does. @jqurious thanks for the clarification Commented Feb 21 at 18:01

2 Answers 2

2

Underlying Problem

In general, all expressions within a (with_columns, select, filter, group_by) context are evaluated in parallel. Especially, there are no columns previously created within the same context.

Solution

Still, you can avoid writing large expressions multiple times, by saving the expression to a variable.

import polars as pl

df = pl.DataFrame({
    "a": [1],
    "b": [2],
    "c": [3],
})

ab_expr = pl.col("a") + pl.col("b")
df.with_columns(
    ab_expr.alias("ab"),
    (ab_expr + pl.col("c")).alias("abc"),
)
shape: (1, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c   ┆ ab  ┆ abc │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   ┆ 3   ┆ 6   │
└─────┴─────┴─────┴─────┴─────┘

Note that polar's query plan optimization accounts for the joint sub-plan and the computation doesn't necessarily happen twice. This can be checked as follows.

ab_expr = pl.col("a") + pl.col("b")
(
    df
    .lazy()
    .with_columns(
        ab_expr.alias("ab"),
        (ab_expr + pl.col("c")).alias("abc"),
    )
    .explain()
)
simple π 5/6 ["a", "b", "c", "ab", "abc"]
   WITH_COLUMNS:
   [col("__POLARS_CSER_0xd4acad4332698399").alias("ab"), [(col("__POLARS_CSER_0xd4acad4332698399")) + (col("c"))].alias("abc")] 
     WITH_COLUMNS:
     [[(col("a")) + (col("b"))].alias("__POLARS_CSER_0xd4acad4332698399")] 
      DF ["a", "b", "c"]; PROJECT */3 COLUMNS

Especially, polars is aware of the sub-plan __POLARS_CSER_0xd4acad4332698399 shared between expressions.

Syntacic Sugar (?)

Moreover, the walrus operation might be used to do the variable assignment within the context.

df.with_columns(
    (ab_expr := pl.col("a") + pl.col("b")).alias("ab"),
    (ab_expr + pl.col("c")).alias("abc"),
)
Sign up to request clarification or add additional context in comments.

2 Comments

Good thinking. Still, I think would be even better If we could use a column created previously within the same context as input to create another col within this context.
@Nip Thanks! FYI. There is a relevant issue / feature request here. TLDR. Currently, there is no plan to change the semantics of a context.
1

I've had this in mind for a bit. Here's a function that gets you the behavior you're looking for.

import polars as pl
from io import StringIO


def aug_exprs(*exprs, **named_exprs):
    no_aliases = {}
    for expr in exprs:
        name = expr.meta.output_name()
        no_aliases[name] = expr.meta.undo_aliases().meta.serialize(format="json")
    for name, expr in named_exprs.items():
        no_aliases[name] = expr.meta.undo_aliases().meta.serialize(format="json")
    no_sub = True
    while no_sub:
        for one in no_aliases.keys():
            for name, expr in no_aliases.items():
                if name == one:
                    continue
                str_col = "{" + f'"Column":"{one}"' + "}"
                if str_col in expr:
                    no_sub = False
                    expr = expr.replace(str_col, no_aliases[one])
                    no_aliases[name] = expr

    return [
        pl.Expr.deserialize(StringIO(expr), format="json").alias(name)
        for name, expr in no_aliases.items()
    ]

Essentially this function is used in between the expressions you type and the context you want to use. It looks at each expression to see if one of the output names of one appears as an input to another. When that happens it replaces the column reference with the whole definition of that column.

If you have that function then you can do:

df=pl.DataFrame({"a":[1,2,3]})

df.lazy().select(aug_exprs(
    (pl.col("a")*2).alias("b"), 
    (pl.col("b")+2).alias("c"), 
    (pl.col("c")-3).alias("d"))
).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ b   ┆ c   ┆ d   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 2   ┆ 4   ┆ 1   │
│ 4   ┆ 6   ┆ 3   │
│ 6   ┆ 8   ┆ 5   │
└─────┴─────┴─────┘

You'll want to only do this in lazy mode so that CSE won't recalc the columns. This is not extensively tested by any means and I'd only use it for interactive sessions where you see if it breaks instantly.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.