0

I am trying to run a custom function on a lazy dataframe on a row-by-row basis. Function itself does not matter, so I'm using softmax as a stand-in. All that matters about it is that it is not computable via pl expressions.

I get about this far:

import polars as pl
import numpy as np

def softmax(t):
    a = np.exp(np.array(t))
    return tuple(t/np.sum(t))

ldf = pl.DataFrame({ 'id': [1,2,3], 'a': [0.2,0.1,0.3], 'b': [0.4,0.1,0.3], 'c': [0.4,0.8,0.4]}).lazy()

cols = ['a','b','c']
redict = { f'column_{i}':c for i,c in enumerate(cols) }

ldf.select(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict)).collect()

However, if I want to get a resulting lazy df that contains columns other than cols (such as id), I get stuck, because

ldf.with_columns(pl.col(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict))).collect()

no longer works, because pl.col(cols).map_batches is done column-by-column...

This does not seem like it would be an uncommon use case, so I'm wondering if I'm missing something.

1
  • FWIW polars is very resistant to row-by-row operations and the apis are in my experience correspondingly limited Commented Mar 3 at 14:55

2 Answers 2

3

Polars is pretty averse to row by row operations. Generally if you need that, I'd suggest unpivoting (formerly, “melting”) and computing over the id column.

ldf.unpivot(index="id").with_columns(
    pl.col("value").map_batches(softmax).over("id")
).collect()
shape: (9, 3)
┌─────┬──────────┬──────────┐
│ id  ┆ variable ┆ value    │
│ --- ┆ ---      ┆ ---      │
│ i64 ┆ str      ┆ f64      │
╞═════╪══════════╪══════════╡
│ 1   ┆ a        ┆ 0.290461 │
│ 2   ┆ a        ┆ 0.249143 │
│ 3   ┆ a        ┆ 0.322043 │
│ 1   ┆ b        ┆ 0.35477  │
│ 2   ┆ b        ┆ 0.249143 │
│ 3   ┆ b        ┆ 0.322043 │
│ 1   ┆ c        ┆ 0.35477  │
│ 2   ┆ c        ┆ 0.501713 │
│ 3   ┆ c        ┆ 0.355913 │
└─────┴──────────┴──────────┘

If you need this back in wide format, you can pivot the resulting DataFrame.

ldf.unpivot(index="id").with_columns(
    pl.col("value").map_batches(softmax).over("id")
).collect().pivot("variable", index="id")
shape: (3, 4)
┌─────┬──────────┬──────────┬──────────┐
│ id  ┆ a        ┆ b        ┆ c        │
│ --- ┆ ---      ┆ ---      ┆ ---      │
│ i64 ┆ f64      ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╪══════════╡
│ 1   ┆ 0.290461 ┆ 0.35477  ┆ 0.35477  │
│ 2   ┆ 0.249143 ┆ 0.249143 ┆ 0.501713 │
│ 3   ┆ 0.322043 ┆ 0.322043 ┆ 0.355913 │
└─────┴──────────┴──────────┴──────────┘
Sign up to request clarification or add additional context in comments.

2 Comments

What you are doing applies softmax over each column, not on a row by row basis, so it does not really solve the problem I'm having
Sorry, switched to being over id instead of variable, which corresponds to row-wise in the original df.
1

I actually found a relatively nice solution that just takes advantage of batches being materialized in memory.

import polars as pl

def softmax(ar):
    a = np.exp(ar)
    return a/np.sum(a,axis=-1)

def apply_npf_on_pl_df(df,cols,npf):
    df[cols] = npf(df[cols].to_numpy())
    return df

ldf = pl.DataFrame({ 'id': [1,2,3], 'a': [0.2,0.1,0.3], 'b': [0.4,0.1,0.3], 'c': [0.4,0.8,0.4]}).lazy()

cols = ['a','b','c']
redict = { f'column_{i}':c for i,c in enumerate(cols) }

ldf.map_batches(lambda bdf: apply_npf_on_pl_df(bdf,cols,softmax)).collect()

This is likely not ideal if there are a lot of other rows, but for my use case (with very few additional columns) this looks pretty efficient.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.