I need to pass a variable number of columns to a user-defined function. The docs mention to first create a pl.struct and subsequently let the function extract it. Here's the example given on the website:
# Add two arrays together:
@guvectorize([(int64[:], int64[:], float64[:])], "(n),(n)->(n)")
def add(arr, arr2, result):
for i in range(len(arr)):
result[i] = arr[i] + arr2[i]
df3 = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})
out = df3.select(
# Create a struct that has two columns in it:
pl.struct(["values1", "values2"])
# Pass the struct to a lambda that then passes the individual columns to
# the add() function:
.map_batches(
lambda combined: add(
combined.struct.field("values1"), combined.struct.field("values2")
)
)
.alias("add_columns")
)
print(out)
Now, in my case, I don't know upfront how many columns will enter the pl.struct. Think of using a selector like pl.struct(cs.float()). In my user-defined function, I need to operate on a np.array. That is, the user-defined function will have one input argument that takes the whole array. How can I then extract it within the user-defined function?
EDIT: The output of my user-defined function will be an array that has the exact same shape as the input array. This array needs to be appended to the existing dataframe on axis 1 (new columns).
EDIT:
Using pl.concat_arr might be one way to attack my concrete issue. My use case would be along the following lines:
def multiply_by_two(arr):
# In reality, there are some complex array operations.
return arr * 2
df = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})
out = df.select(
# Create an array consisting of two columns:
pl.concat_arr(["values1", "values2"])
.map_batches(lambda arr: multiply_by_two(arr))
.alias("result")
)
The new computed column result holds an array that has the same shape as the input array. I need to unnest the array (something like pl.struct.unnest()). The headings should be the original headings suffixed by "result" (values1_result and values2_result).
Also, I would like to make use of @guvectorize to speed things up.
map_batchesyou have a Series, if you want a numpy array you can just calls.to_numpy()- but it may make more sense to usepl.concat_arrinstead of a struct for your use case? Perhaps you give a code example that is closer to the actual task.arris a pl.Series, you can just usearr.to_numpy()if you want a numpy array - right?arr.to_numpy()withinmultiply_by_two, then you're correct. Having said that, I think this is not going to work in conjunction withguvectorize. I am also struggling when performingmap_batchesovera group_by.