I am trying to run a custom function on a lazy dataframe on a row-by-row basis. Function itself does not matter, so I'm using softmax as a stand-in. All that matters about it is that it is not computable via pl expressions.
I get about this far:
import polars as pl
import numpy as np
def softmax(t):
a = np.exp(np.array(t))
return tuple(t/np.sum(t))
ldf = pl.DataFrame({ 'id': [1,2,3], 'a': [0.2,0.1,0.3], 'b': [0.4,0.1,0.3], 'c': [0.4,0.8,0.4]}).lazy()
cols = ['a','b','c']
redict = { f'column_{i}':c for i,c in enumerate(cols) }
ldf.select(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict)).collect()
However, if I want to get a resulting lazy df that contains columns other than cols (such as id), I get stuck, because
ldf.with_columns(pl.col(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict))).collect()
no longer works, because pl.col(cols).map_batches is done column-by-column...
This does not seem like it would be an uncommon use case, so I'm wondering if I'm missing something.