2

I have been struggling with a problem with custom aggregate function in Pandas that I have not been able to figure it out. let's consider the following data frame:

import numpy as np
import pandas as pd
df = pd.DataFrame({'value': np.arange(1, 5), 'weights':np.arange(1, 5)})

Now if, I want to calculate the the average of the value column using the agg in Panadas, it would be:

df.agg({'value': 'mean'})

which results in a scaler value of 2.5 as shown in the following: enter image description here

However, if I define the following custom mean function:

def my_mean(vec):
    return np.mean(vec)

and use it in the following code:

df.agg({'value': my_mean})

I would get the following result:

enter image description here

So, the question here is, what should I do to get the same result as default mean aggregate function. One more thing to note that, if I use the mean function as a method in the custom function (shown below), it works just fine, however, I would like to know how to use np.mean function in my custom function. Any help would be much appreciated!

df my_mean2(vec):
   return vec.mean()

1 Answer 1

2

When you pass a callable as the aggregate function, if that callable is not one of the predefined callables like np.mean, np.sum, etc It'll treat it as a transform and acts like df.apply().

The way around it is to let pandas know that your callable expects a vector of values. A crude way to do it is to have sth like:

def my_mean(vals):
    print(type(vals))
    try:
        vals.shape
    except:
        raise TypeError()

    return np.mean(vals)

>>> df.agg({'value': my_mean})
<class 'int'>
<class 'pandas.core.series.Series'> 
value    2.5
dtype: float64

You see, at first pandas tries to call the function on each row (df.apply), but my_mean raises a type error and in the second attempt it'll pass the whole column as a Series object. Comment the try...except part out and you'll see my_mean will be called on each row with an int argument.


more on the first part:

my_mean1 = np.mean
my_mean2 = lambda *args, **kwargs: np.mean(*args, **kwargs)

df.agg({'value': my_mean1})
df.agg({'value': my_mean2})

Although my_mean2 and np.mean are essentially the same, since my_mean2 is np.mean evaluates to false, it'll go down the df.apply route while my_mean1 will work as expected.

Sign up to request clarification or add additional context in comments.

3 Comments

Awesome! Thanks so much for such a complete and in-depth explanation!
@Ashkan No worries! Would you mind accepting the answer if it solved your problem? Thanks
Just did! Thanks again:)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.