1

I am wondering why pandas assign function cannot handle returned lists.

For example

df = pd.DataFrame({
    "id" : [1,2,3,4,5], 
    "val" : [10,20,30,30,40]
})


def squareMe(x):
    return x**2

df = df.assign(val2 = lambda x: squareMe(x.val))

# Out > Works fine : Returns a DataFrame with squared values

But if we return a list,

def squareMe(x):
    return [x**2]

df = df.assign(val2 = lambda x: squareMe(x.val))

#Out > ValueError: Length of values (1) does not match length of index (5)

However pandas apply function works fine when returning a list

def squareMe(x):
    return [x**2]
df["val2"] = df.val.apply(lambda x: squareMe(x))

Any particular reason why this is or am I doing something wrong?

3 Answers 3

1

Since you reference x.val in the call to squareMe, that function is passed a list (you can easily verify this by adding a debug statement to print type(x) inside the function).

Thus, x ** 2 returns a Series (since the expression is vectorized) and the assignment works correctly.

But when you return [x ** 2] you're returning the Series inside a list, which doesn't make Sense to apply since all it sees is an iterable of size "1" (the series inside it) and it deems this to be the incorrect length for performing a column assignment to a DataFrame of size 5 (which is exactly what ValueError: Length of values (1) does not match length of index (5) means).

The difference is with apply is that the function receives a number, not a series. And so you still return a single item (a list) which apply accepts, but is still technically wrong since you shouldn't need to wrap the result in a list.

More information: df.assign, df.apply

P.S.: you probably already understand this, but you can simplify this to df['val'] = df['x'] ** 2

Sign up to request clarification or add additional context in comments.

Comments

0

assign isn't particularly meant for this, it is for assigning columns already returned sequences as the arguments.

Docs:

**Parameters : kwargs : dict of {str: callable or Series}

The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Doing [x ** 2] returns a series of lists which would be treated like a matrix (or dataframe), and therefore as the error mentions:

ValueError: Length of values (1) does not match length of index (5)

The length of values wouldn't match to the index.

Comments

0

This has been driving me nuts all day in my own work, but I've got it now. cs95 is almost correct, but not quite. If you follow their advice and put a print(f"{type(x)}") in your squareMe function you'll see that it's a Series, not a list.

That's the catch, x.val is always a Series (the entire column of values), and squareMe returns a Series.

In contrast, apply, if you specify axis=1, will iterate over each row in the column, so each value of x.val and pass each one to squareMe, building a new Series for your new column in the process.

The reason it confused you (and me!) is that, when it works in your first example, it looks like squareMe is operating on integers and returning an integer for each row. But in fact, it's taking advantage of operator overloading to square the Series, not individual values: It's using the pow function, which is aliased as **, which like the other overloaded operators on Series, works element-wise.

Now, when you change squareMe to return the list of the result: [x**2], it's again squaring the entire Series to get a new Series of squares, but then making a list of that Series. That is, a list of a single element, the element being a Series.

Now assign was expecting a Series back from squareMe of the same length as the index of the dataframe, which is 5, and you returned it a list with a single element - hence the error: expected length 5, got 1.

Your apply, in the meantime, is working on the Series val because that's what you called it on, and it's iterating over the values in that series. Another way to do the apply, which is closer to your assign is this:

df["val2"] = df.apply(lambda x: squareMe(x.val), axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.