Passing a polars struct to a user-defined function using map_batches

Question

I need to pass a variable number of columns to a user-defined function. The docs mention to first create a pl.struct and subsequently let the function extract it. Here's the example given on the website:

# Add two arrays together:
@guvectorize([(int64[:], int64[:], float64[:])], "(n),(n)->(n)")
def add(arr, arr2, result):
    for i in range(len(arr)):
        result[i] = arr[i] + arr2[i]


df3 = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})

out = df3.select(
    # Create a struct that has two columns in it:
    pl.struct(["values1", "values2"])
    # Pass the struct to a lambda that then passes the individual columns to
    # the add() function:
    .map_batches(
        lambda combined: add(
            combined.struct.field("values1"), combined.struct.field("values2")
        )
    )
    .alias("add_columns")
)
print(out)

Now, in my case, I don't know upfront how many columns will enter the pl.struct. Think of using a selector like pl.struct(cs.float()). In my user-defined function, I need to operate on a np.array. That is, the user-defined function will have one input argument that takes the whole array. How can I then extract it within the user-defined function?

EDIT: The output of my user-defined function will be an array that has the exact same shape as the input array. This array needs to be appended to the existing dataframe on axis 1 (new columns).

EDIT: Using pl.concat_arr might be one way to attack my concrete issue. My use case would be along the following lines:

def multiply_by_two(arr):
    # In reality, there are some complex array operations.
    return arr * 2


df = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})

out = df.select(
    # Create an array consisting of two columns:
    pl.concat_arr(["values1", "values2"])
    .map_batches(lambda arr: multiply_by_two(arr))
    .alias("result")
)

The new computed column result holds an array that has the same shape as the input array. I need to unnest the array (something like pl.struct.unnest()). The headings should be the original headings suffixed by "result" (values1_result and values2_result).

Also, I would like to make use of @guvectorize to speed things up.

Inside map_batches you have a Series, if you want a numpy array you can just call s.to_numpy() - but it may make more sense to use pl.concat_arr instead of a struct for your use case? Perhaps you give a code example that is closer to the actual task. — jqurious
– jqurious, Commented Feb 18 at 11:59
arr is a pl.Series, you can just use arr.to_numpy() if you want a numpy array - right? — jqurious
– jqurious, Commented Feb 19 at 13:37
If you are talking about arr.to_numpy() within multiply_by_two, then you're correct. Having said that, I think this is not going to work in conjunction with guvectorize. I am also struggling when performing map_batches over a group_by. — Andi
– Andi, Commented Feb 19 at 14:09

Dean MacGregor · Accepted Answer · 2025-02-20 18:17:29Z

A few things, if you use .to_numpy on either an array or a struct, it seems to return the same np.array so the difference in which to choose comes down to memory efficiency and features. The elements of an Array aren't named and you want the output names to correspond to the input columns so that means you probably want a struct. I'm not sure what the memory implications are between the two. I know that going from columns to a struct is cheaper than going from columns to Array but intuitively it seems that columns->struct->np.array ought to be about the same as columns->array->np.array.

Anyway, with that said, here's how to do it:

def multiply_by_two(arr: pl.Series)->pl.Series:
    # capture names of input
    names = arr.struct.fields
    arrnp=arr.to_numpy()
    res = arrnp * 2
    return pl.Series(res).arr.to_struct(fields=[f"{name}_result" for name in names])

df.with_columns(
    # Create an array consisting of two columns:
    pl.struct(["values1", "values2"])
    .map_batches(lambda arr: multiply_by_two(arr))
    .alias("result")
).unnest("result")

shape: (3, 4)
┌─────────┬─────────┬────────────────┬────────────────┐
│ values1 ┆ values2 ┆ values1_result ┆ values2_result │
│ ---     ┆ ---     ┆ ---            ┆ ---            │
│ i64     ┆ i64     ┆ i64            ┆ i64            │
╞═════════╪═════════╪════════════════╪════════════════╡
│ 1       ┆ 10      ┆ 2              ┆ 20             │
│ 2       ┆ 20      ┆ 4              ┆ 40             │
│ 3       ┆ 30      ┆ 6              ┆ 60             │
└─────────┴─────────┴────────────────┴────────────────┘

You can't unnest from within the .with_columns you have to do it at the DataFrame level.

As for combining the above with numba, it should be relatively the same. Just search for polars and numba to find other questions/answers where the two are used together. If you can make a more specific question specifically about their interaction then ask away.

Collectives™ on Stack Overflow

Passing a polars struct to a user-defined function using map_batches

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related