18

I want to pass each row of a Polars DataFrame into a custom function.

def my_complicated_function(row):
    
    # ...
    
    return result
df = pl.DataFrame({
    "foo": [1, 2, 3], 
    "bar": [4, 5, 6], 
    "baz": [7, 8, 9]
})

I need to process the values with some custom Python logic and want to store the result in a new column.

shape: (3, 4)
┌─────┬─────┬─────┬──────────┐
│ foo ┆ bar ┆ baz ┆ result   │
│ --- ┆ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ i64 ┆ str      │
╞═════╪═════╪═════╪══════════╡
│ 1   ┆ 4   ┆ 7   ┆ result 1 │
│ 2   ┆ 5   ┆ 8   ┆ result 2 │
│ 3   ┆ 6   ┆ 9   ┆ result 3 │
└─────┴─────┴─────┴──────────┘

In Pandas, I would use df.apply(..., axis=1) for this.

1 Answer 1

27

The first step would be to check if your task can be solved natively using Polars Expressions.

If a custom function is neccessary, .map_elements() can be used to apply one on a row by row basis.

To pass in values from multiple columns, you can utilize the Struct data type.

e.g. with pl.struct()

>>> df.select(pl.struct(pl.all())) # all columns
shape: (3, 1)
┌───────────┐
│ foo       │
│ ---       │
│ struct[3] │
╞═══════════╡
│ {1,4,7}   │
│ {2,5,8}   │
│ {3,6,9}   │
└───────────┘

Using pl.struct(...).map_elements will pass the values to the custom function as a dict argument.

def my_complicated_function(row: dict) -> int:
    """
    A function that cannot utilize polars expressions.
    This should be avoided.
    """

    # a dict with column names as keys
    print(f"[DEBUG]: {row=}")
    
    # do some work
    return row["foo"] + row["bar"] + row["baz"]


df = pl.DataFrame({
    "foo": [1, 2, 3], 
    "bar": [4, 5, 6], 
    "baz": [7, 8, 9]
})

df = df.with_columns(
    pl.struct(pl.all())
      .map_elements(my_complicated_function, return_dtype=pl.Int64)
      .alias("foo + bar + baz")
)
# [DEBUG]: row={'foo': 1, 'bar': 4, 'baz': 7}
# [DEBUG]: row={'foo': 2, 'bar': 5, 'baz': 8}
# [DEBUG]: row={'foo': 3, 'bar': 6, 'baz': 9}
shape: (3, 4)
┌─────┬─────┬─────┬─────────────────┐
│ foo ┆ bar ┆ baz ┆ foo + bar + baz │
│ --- ┆ --- ┆ --- ┆ ---             │
│ i64 ┆ i64 ┆ i64 ┆ i64             │
╞═════╪═════╪═════╪═════════════════╡
│ 1   ┆ 4   ┆ 7   ┆ 12              │
│ 2   ┆ 5   ┆ 8   ┆ 15              │
│ 3   ┆ 6   ┆ 9   ┆ 18              │
└─────┴─────┴─────┴─────────────────┘
Sign up to request clarification or add additional context in comments.

5 Comments

Wow. This new approach using struct performs much better! I benchmarked this approach using the struct expression versus the solution using map and a list of expressions. I created a dataframe of 4 columns of 10 million integers each, and a trivial row-wise sum function (lol - not that one should ever use either approach for calculating row-sums). The old approach took 49 seconds. The new approach (using struct) used only 12 seconds! Accordingly, I'm removing my answer. Great job!
Nice! I am also thinking of removing the dataframe apply in favor of this. That should make it clear that there is a single way of doing this.
NotFoundError: ham in the newest version
I think you should open an issue at github. In any case, I can confirm that this snippet runs successfully on latest release on pypi.
Is this nearly 3 years later still the way to go if one wants to apply a UDF to a whole row?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.