Error when using pandas assign function when returned value is alist

Question

I am wondering why pandas assign function cannot handle returned lists.

For example

df = pd.DataFrame({
    "id" : [1,2,3,4,5], 
    "val" : [10,20,30,30,40]
})


def squareMe(x):
    return x**2

df = df.assign(val2 = lambda x: squareMe(x.val))

# Out > Works fine : Returns a DataFrame with squared values

But if we return a list,

def squareMe(x):
    return [x**2]

df = df.assign(val2 = lambda x: squareMe(x.val))

#Out > ValueError: Length of values (1) does not match length of index (5)

However pandas apply function works fine when returning a list

def squareMe(x):
    return [x**2]
df["val2"] = df.val.apply(lambda x: squareMe(x))

Any particular reason why this is or am I doing something wrong?

cs95 · Accepted Answer · 2021-10-07 09:17:10Z

Since you reference x.val in the call to squareMe, that function is passed a list (you can easily verify this by adding a debug statement to print type(x) inside the function).

Thus, x ** 2 returns a Series (since the expression is vectorized) and the assignment works correctly.

But when you return [x ** 2] you're returning the Series inside a list, which doesn't make Sense to apply since all it sees is an iterable of size "1" (the series inside it) and it deems this to be the incorrect length for performing a column assignment to a DataFrame of size 5 (which is exactly what ValueError: Length of values (1) does not match length of index (5) means).

The difference is with apply is that the function receives a number, not a series. And so you still return a single item (a list) which apply accepts, but is still technically wrong since you shouldn't need to wrap the result in a list.

More information: df.assign, df.apply

P.S.: you probably already understand this, but you can simplify this to df['val'] = df['x'] ** 2

U13-Forward · Accepted Answer · 2021-10-07 09:13:57Z

assign isn't particularly meant for this, it is for assigning columns already returned sequences as the arguments.

Docs:

**Parameters : kwargs : dict of {str: callable or Series}

The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Doing [x ** 2] returns a series of lists which would be treated like a matrix (or dataframe), and therefore as the error mentions:

ValueError: Length of values (1) does not match length of index (5)

The length of values wouldn't match to the index.

Rhubarb · Accepted Answer · 2022-11-12 19:25:03Z

This has been driving me nuts all day in my own work, but I've got it now. cs95 is almost correct, but not quite. If you follow their advice and put a print(f"{type(x)}") in your squareMe function you'll see that it's a Series, not a list.

That's the catch, x.val is always a Series (the entire column of values), and squareMe returns a Series.

In contrast, apply, if you specify axis=1, will iterate over each row in the column, so each value of x.val and pass each one to squareMe, building a new Series for your new column in the process.

The reason it confused you (and me!) is that, when it works in your first example, it looks like squareMe is operating on integers and returning an integer for each row. But in fact, it's taking advantage of operator overloading to square the Series, not individual values: It's using the pow function, which is aliased as **, which like the other overloaded operators on Series, works element-wise.

Now, when you change squareMe to return the list of the result: [x**2], it's again squaring the entire Series to get a new Series of squares, but then making a list of that Series. That is, a list of a single element, the element being a Series.

Now assign was expecting a Series back from squareMe of the same length as the index of the dataframe, which is 5, and you returned it a list with a single element - hence the error: expected length 5, got 1.

Your apply, in the meantime, is working on the Series val because that's what you called it on, and it's iterating over the values in that series. Another way to do the apply, which is closer to your assign is this:

df["val2"] = df.apply(lambda x: squareMe(x.val), axis=1)

Collectives™ on Stack Overflow

Error when using pandas assign function when returned value is alist

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related