3

In the code below I'm creating a polars- and a pandas dataframe with identical data. I want to select a set of rows based on a condition on column A, then update the corresponding rows for column C. I've included how I would do this with the pandas dataframe, but I'm coming up short on how to get this working with polars. The closest I've gotten is by using when-then-otherwise, but I'm unable to use anything other than a single value in then.

import pandas as pd
import polars as pl

df_pd = pd.DataFrame({'A': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],
                      'B': [1, 1, 2, 2, 1, 1, 2, 2],
                      'C': [1, 2, 3, 4, 5, 6, 7, 8]})

df_pl = pl.DataFrame({'A': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y'],
                      'B': [1, 1, 2, 2, 1, 1, 2, 2],
                      'C': [1, 2, 3, 4, 5, 6, 7, 8]})

df_pd.loc[df_pd['A'] == 'x', 'C'] = [-1, -2, -3, -4]

df_pl ???

Expected output:

┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 1   ┆ -1  │
│ x   ┆ 1   ┆ -2  │
│ x   ┆ 2   ┆ -3  │
│ x   ┆ 2   ┆ -4  │
│ y   ┆ 1   ┆ 5   │
│ y   ┆ 1   ┆ 6   │
│ y   ┆ 2   ┆ 7   │
│ y   ┆ 2   ┆ 8   │
└─────┴─────┴─────┘
1
  • show what you expect the subsequent result of this operation to be ( manually type that if you can't code it) , Commented Dec 16, 2024 at 23:51

4 Answers 4

5

Actually, in-place updates similar to pandas are supported in polars. Especially, the following works as expected.

df_pl[[0, 1, 2, 3], "C"] = [-1, -2, -3, -4]

Instead of a list of indices, a dataframe with single integer column may also be passed. Especially, we can do the following.

idx = df_pl.with_row_index().filter(pl.col("A") == "x").select("index")
df_pl[idx, "C"] = [-1, -2, -3, -4]
shape: (8, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 1   ┆ -1  │
│ x   ┆ 1   ┆ -2  │
│ x   ┆ 2   ┆ -3  │
│ x   ┆ 2   ┆ -4  │
│ y   ┆ 1   ┆ 5   │
│ y   ┆ 1   ┆ 6   │
│ y   ┆ 2   ┆ 7   │
│ y   ┆ 2   ┆ 8   │
└─────┴─────┴─────┘

See this answer for an alternative solution using pl.DataFrame.update.

Sign up to request clarification or add additional context in comments.

Comments

3

If you wrap the values in a pl.lit Series, you can index the values with Expr.get

values = pl.lit(pl.Series([-1, -2, -3, -4]))
idxs = pl.when(pl.col.A == 'x').then(1).cum_sum() - 1

df.with_columns(C = pl.coalesce(values.get(idxs), 'C'))
shape: (8, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 1   ┆ -1  │
│ x   ┆ 1   ┆ -2  │
│ x   ┆ 2   ┆ -3  │
│ x   ┆ 2   ┆ -4  │
│ y   ┆ 1   ┆ 5   │
│ y   ┆ 1   ┆ 6   │
│ y   ┆ 2   ┆ 7   │
│ y   ┆ 2   ┆ 8   │
└─────┴─────┴─────┘

These are the steps expanded.

The indices are created, used to .get() and .coalesce() combines in the values from the other column.

df.with_columns(
    idxs = idxs,
    values = values.get(idxs),
    D = pl.coalesce(values.get(idxs), 'C')
)
shape: (8, 6)
┌─────┬─────┬─────┬──────┬────────┬─────┐
│ A   ┆ B   ┆ C   ┆ idxs ┆ values ┆ D   │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---    ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i32  ┆ i64    ┆ i64 │
╞═════╪═════╪═════╪══════╪════════╪═════╡
│ x   ┆ 1   ┆ 1   ┆ 0    ┆ -1     ┆ -1  │
│ x   ┆ 1   ┆ 2   ┆ 1    ┆ -2     ┆ -2  │
│ x   ┆ 2   ┆ 3   ┆ 2    ┆ -3     ┆ -3  │
│ x   ┆ 2   ┆ 4   ┆ 3    ┆ -4     ┆ -4  │
│ y   ┆ 1   ┆ 5   ┆ null ┆ null   ┆ 5   │
│ y   ┆ 1   ┆ 6   ┆ null ┆ null   ┆ 6   │
│ y   ┆ 2   ┆ 7   ┆ null ┆ null   ┆ 7   │
│ y   ┆ 2   ┆ 8   ┆ null ┆ null   ┆ 8   │
└─────┴─────┴─────┴──────┴────────┴─────┘

Another option is to get the row index of each True, e.g. using pl.arg_where()

You can then add a row index and .replace_strict() in the new values.

df.with_columns(
    pl.int_range(pl.len()).replace_strict(
        pl.arg_where(pl.col.A == "x"),
        [-1, -2, -3, -4],
        default = pl.col.C
    )
    .alias("C")
)
shape: (8, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 1   ┆ -1  │
│ x   ┆ 1   ┆ -2  │
│ x   ┆ 2   ┆ -3  │
│ x   ┆ 2   ┆ -4  │
│ y   ┆ 1   ┆ 5   │
│ y   ┆ 1   ┆ 6   │
│ y   ┆ 2   ┆ 7   │
│ y   ┆ 2   ┆ 8   │
└─────┴─────┴─────┘

Comments

3

In Polars, there is not really a notion of assigning to a slice of a DataFrame.

Edit: the above statement was incorrect. See the answer by @Hericks for how this can be achieved. Do note thought that doing so is not considered idiomatic in Polars.

Also, in when/then/otherwise, Polars expects lengths of everything to be compatible. They have to all be the same length, or be scalars that are then broadcasted.

With those things in mind, here are a few options:

Given you know that there are 4 values "x" in column A, you can split the df, update the column and concat the result back together. This works regardless of which rows the 4 "x" values are in.

pl.concat([
  df_pl.filter(pl.col("A") == "x").with_columns(C=pl.Series([-1, -2, -3, -4])),
  df_pl.filter(pl.col("A") != "x"),
])

If you also know that the "x" rows are the first 4 rows, you can pad the new values with nulls and then use when/then/otherwise or coalesce. This only works when you know they are the first 4 rows.

new_values = [-1, -2, -3, -4]
new_c = pl.Series(new_values).extend_constant(None, df_pl.height - len(new_values))
df_pl.with_columns(C=pl.coalesce(new_c, "C"))

On your example data, both of the above snippets output

shape: (8, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 1   ┆ -1  │
│ x   ┆ 1   ┆ -2  │
│ x   ┆ 2   ┆ -3  │
│ x   ┆ 2   ┆ -4  │
│ y   ┆ 1   ┆ 5   │
│ y   ┆ 1   ┆ 6   │
│ y   ┆ 2   ┆ 7   │
│ y   ┆ 2   ┆ 8   │
└─────┴─────┴─────┘

Note to anyone else reading this answer that if you are only needing to assign a scalar (literal value) or have a new list the same length as the DataFrame, just use a plain when/then/otherwise as outlined here in the user guide and here in the docs instead of the suggestions above.

Comments

2

If you don't know position of your x values, then you could generate "row index" on the fly and use it. For example, with pl.DataFrame.update():

new_values = [-1, -2, -3, -4]

(
    df_pl
    .with_columns(index = pl.int_range(pl.len()).over("A").cast(pl.UInt32))
    .update(
        pl.DataFrame({"A": "x", "C": new_values}).with_row_index(),
        on=["A","index"],
        how="left"
    )
    .drop("index")
)
shape: (8, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 1   ┆ -1  │
│ x   ┆ 1   ┆ -2  │
│ x   ┆ 2   ┆ -3  │
│ x   ┆ 2   ┆ -4  │
│ y   ┆ 1   ┆ 5   │
│ y   ┆ 1   ┆ 6   │
│ y   ┆ 2   ┆ 7   │
│ y   ┆ 2   ┆ 8   │
└─────┴─────┴─────┘

Or something like

(
    df_pl.with_row_index()
    .update(
        pl.DataFrame({
            "C": pl.Series(new_values),
            "index": df_pl.select((pl.col.A == "x").arg_true())
        }),
        on=["index"],
        how="left"
    )
    .drop("index")
)
shape: (8, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ x   ┆ 1   ┆ -1  │
│ x   ┆ 1   ┆ -2  │
│ x   ┆ 2   ┆ -3  │
│ x   ┆ 2   ┆ -4  │
│ y   ┆ 1   ┆ 5   │
│ y   ┆ 1   ┆ 6   │
│ y   ┆ 2   ┆ 7   │
│ y   ┆ 2   ┆ 8   │
└─────┴─────┴─────┘

If you know that x rows positioned at the beginning of the DataFrame, you can do:

df_pl.with_columns(
    pl.Series(new_values)
    .append(df_pl["C"].tail(-len(new_values)))
    .alias("C")
)

And, if x might not be in the front of the DataFrame, but you don't care about original order of the rows, you can sort it first:

(
    df_pl.sort(
        pl.when(pl.col.A == "x").then(0).otherwise(1),
        maintain_order = True
    )
    .with_columns(
        pl.Series(new_values)
        .append(df_pl["C"].tail(-len(new_values)))
        .alias("C")
    )
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.