3

I've tried searching around but a lot of the answers seem to be old and no longer valid with the current polars version. How do I apply the return result of a python function to each row of the polars dataframe? I want to pass the entire row to the function instead of passing specific columns.

import polars as pl

def test2(auth, row):
    c = row["Group"]
    d = row["Val"]
    return "{}-{}-{}".format(c, str(d), auth)

df = pl.DataFrame({
    'Group': ['A', 'B', 'C', 'D', 'E'],
    'Val': [1001, 1002, 1003, 1004, 1005]
})

auth_token = "xxxxxxxxx"

df = df.with_columns(
    pl.struct(pl.all())
    .map_batches(lambda x: test2(auth_token, x))
    .alias("response")
)

print(df)

The code above causes this error. I don't understand this message. Where am I supposed to set strict=False and why is this necessary?

Traceback (most recent call last):
  File "c:\Scripting\Python\Development\Test.py", line 29, in <module>
    df = df.with_columns(
  File "c:\Scripting\Python\Development\venv\lib\site-packages\polars\dataframe\frame.py", line 8763, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
  File "c:\Scripting\Python\Development\venv\lib\site-packages\polars\lazyframe\frame.py", line 1942, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: TypeError: unexpected value while building Series of type Int64; found value of type String: "C"       

Hint: Try setting `strict=False` to allow passing data with mixed types.

I'm aware that I could do this specifying specific columns such as the code below but I want to pass in the whole row and then select which columns to use inside the function instead. Any help would be appreciated. Thank you.

df = df.with_columns(
    (
        pl.struct(["Group", "Val"]).map_batches(
            lambda x: test(auth_token, x.struct.field("Group"), x.struct.field("Val"))
        )
    ).alias("api_response")
)
1
  • in your example you can just use map_elements() instead of map_batches(). But it will be quite slow cause you're not using any of polars optimizations that way Commented Jul 3, 2024 at 13:43

1 Answer 1

0

You can use map_elements() instead.

df = df.with_columns(
    pl.struct(pl.all())
    .map_elements(lambda x: test2(auth_token, x))
    .alias("response")
)

┌───────┬──────┬──────────────────┐
│ Group ┆ Val  ┆ response         │
│ ---   ┆ ---  ┆ ---              │
│ str   ┆ i64  ┆ str              │
╞═══════╪══════╪══════════════════╡
│ A     ┆ 1001 ┆ A-1001-xxxxxxxxx │
│ B     ┆ 1002 ┆ B-1002-xxxxxxxxx │
│ C     ┆ 1003 ┆ C-1003-xxxxxxxxx │
│ D     ┆ 1004 ┆ D-1004-xxxxxxxxx │
│ E     ┆ 1005 ┆ E-1005-xxxxxxxxx │
└───────┴──────┴──────────────────┘

If you want to split your function result into multiple columns (assuming you can change the function):

  • using dictionary as return value, so it will end up as struct column in the DataFrame.
  • unnest() to split resulting struct into separate columns
def test2(auth, row):
    c = row["Group"]
    d = row["Val"]
    return {"c": c, "d": str(d), "auth": auth}

(
    df
    .with_columns(
        pl.struct(pl.all())
        .map_elements(lambda x: test2(auth_token, x))
        .alias("response")
    )
).unnest("response")

┌───────┬──────┬─────┬──────┬───────────┐
│ Group ┆ Val  ┆ c   ┆ d    ┆ auth      │
│ ---   ┆ ---  ┆ --- ┆ ---  ┆ ---       │
│ str   ┆ i64  ┆ str ┆ str  ┆ str       │
╞═══════╪══════╪═════╪══════╪═══════════╡
│ A     ┆ 1001 ┆ A   ┆ 1001 ┆ xxxxxxxxx │
│ B     ┆ 1002 ┆ B   ┆ 1002 ┆ xxxxxxxxx │
│ C     ┆ 1003 ┆ C   ┆ 1003 ┆ xxxxxxxxx │
│ D     ┆ 1004 ┆ D   ┆ 1004 ┆ xxxxxxxxx │
│ E     ┆ 1005 ┆ E   ┆ 1005 ┆ xxxxxxxxx │
└───────┴──────┴─────┴──────┴───────────┘
Sign up to request clarification or add additional context in comments.

4 Comments

This worked perfectly. Thank you. I misunderstood the difference between map_batches and map_elements.
On a related note, are you aware of a way to break the response from the function into seperate columns? Like if instead of combining the values into one string and returning as response, I wanted to return each value (c, d, and auth) into their own columns in the dataframe? Thanks.
you can return tuple or list from the function and then split the list into columns
Do you have a short example of the syntax for doing that? Appreciate it a lot if so. Thanks!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.