0

I am still struggling to get really familiar with the pandas groupby operations. When passing a function to agg, what if the aggregation function that is passed needs to consider values in columns other than those that are being aggregated.

Consider the following data frame example which is lists sales of products by two salesmen:

DateList = np.array( [(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 2, 2, 3, 4, 4, 5]] + \
[(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 1, 2, 3, 4, 5, 5]]
Names = np.array(['Joe' for x in xrange(7)] + ['John' for x in xrange(7)])
Product = np.array(['Product1', 'Product1', 'Product2', 'Product2', 'Product2', 'Product3', 'Product3', \
                    'Product1', 'Product2', 'Product2', 'Product2', 'Product2', 'Product2', 'Product3'])
Volume = np.array([100, 0, 150, 175, 15, 120, 150, 75, 0, 115, 130, 135, 10, 120])
Prices = {'Product1' : 25.99, 'Product2': 13.99, 'Product3': 8.99}
SalesDF = DataFrame({'Date' : DateLists, 'Seller' : Names, 'Product' : Product, 'Volume' : Volume})
SalesDF.sort(['Date', 'Seller'], inplace = True)
SalesDF['Prices'] = SalesDF.Product.map(Prices)

On some days each seller sells more than one item. Suppose you wanted to aggregate the data set into a single day/seller observations, and you wished to do so based upon which product sold the most volume. To be clear, this would be simple for the volume measure, simply pass a max function to agg. However for evaluating which Product and Price would remain would mean determining which volume measure was highest and then returning the value tha corresponds with that max value.

I am able to get the result I want by using the index values in the column that is passed to the function when agg is called and referencing the underlying data frame:

def AggFunc(x, df, col1):
        #Create list of index values that index the data in the column passed as x
    IndexVals = list(x.index)

        #Use those index values to create a list of the values of col1 in those index positions in the underlying data frame. 
    ColList = list(df[col1][IndexVals])

        # Find the max value of the list of values of col1
    MaxVal = np.max(ColList)

        # Find the index value of the max value of the list of values of col1
    MaxValIndex = ColList.index(MaxVal)

        #Return the data point in the list of data passed as column x which correspond to index value of the the max value of the list of col1 data
    return list(x)[MaxValIndex]

FunctionDict = {'Product': lambda x : AggFunc(x, SalesDF, 'Volume'), 'Volume' : 'max',\
'Prices': lambda x : AggFunc(x, SalesDF, 'Volume')}

SalesDF.groupby(['Date', "Seller"], as_index = False).agg(FunctionDict)

But I'm wondering if there is a better way where I can pass 'Volume' as an argument to the function that aggregates Product without having to get the index values and create lists from the data in the underlying dataframe? Something tells me no, as agg passes each column as a series to the aggregation function, rather than the dataframe itself.

Any ideas?

Thanks

1
  • Maybe this is not what agg was intended for. Maybe I should be using apply... Commented Nov 10, 2013 at 14:22

1 Answer 1

1

Maybe extracting the right indices first using .idxmax would be simpler?

>>> grouped = SalesDF.groupby(["Date", "Seller"])["Volume"]
>>> max_idx = grouped.apply(pd.Series.idxmax)
>>> SalesDF.loc[max_idx]
          Date   Product Seller  Volume  Prices
0   2013-11-04  Product1    Joe     100   25.99
7   2013-11-04  Product1   John      75   25.99
2   2013-11-05  Product2    Joe     150   13.99
9   2013-11-05  Product2   John     115   13.99
3   2013-11-06  Product2    Joe     175   13.99
10  2013-11-06  Product2   John     130   13.99
5   2013-11-07  Product3    Joe     120    8.99
11  2013-11-07  Product2   John     135   13.99
6   2013-11-08  Product3    Joe     150    8.99
13  2013-11-08  Product3   John     120    8.99

idxmax gives the index of the first occurrence of the maximum value. If you want to keep multiple products if they all obtain the maximum volume, it'd be a little different, something more like

>>> max_vols = SalesDF.groupby(["Date", "Seller"])["Volume"].transform(max)
>>> SalesDF[SalesDF.Volume == max_vols]
Sign up to request clarification or add additional context in comments.

3 Comments

This looks like an incredibly interesting solution, if you are able to supply some explanation that would be amazing, i'm not familiar with idxmax... I'll look it up as well. Thanks
Oh I see, thats pretty sweet actually! Is it possible that you meant to write max_idx in the final square bracket. Also, am I right that passing a column to the agg func that references a column that is not being aggreated is not possible? This is a neat workaround, but perhaps not entirely an answer to my problem
Yep, sorry-- changed variable names on you. I can't think of a good way to do what you want with agg, but people are always coming up with ways to do things I can't figure out, so that doesn't prove anything.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.