Pandas: Aggregating and applying multiple functions to same column

Question

I have a pandas DataFrame like this:

n = 6000
my_data = DataFrame ({
    "Category"  : np.random.choice (['cat1','cat2'], size=n) ,
    "val_1"     : np.random.randn(n) ,
    "val_2"     : [i for i in range (1,n+1)]
})

I want to calculate the count of one column and the means of the other two, aggregating by Category. This is described in the pandas documentation as "Applying different functions to DataFrame columns", and I do it like this:

counts_and_means = \
    my_data.groupby("Category").agg (
        {
            "Category"  : np.count_nonzero ,
            "val_1"     : np.mean ,
            "val_2"     : np.mean
        }
    )

I also want to calculate a t-test p-variable for val_2, testing the hypothesis that the mean of val_2 is zero. If val_2 were the only column I was doing anything with throughout this whole process, I could just do what is described in the Pandas documentation as "Applying multiple functions at once." However, I'm trying to do both multiple columns AND multiple functions. I can explicitly name output columns when it's just the "multiple functions at once" case, but I can't figure out how to do it when there are also multiple columns involved. Right now when I try to do this all in one agg(...) step, the val_2 p-value column definition overwrites the original mean column definition, because they're both in the same dict. So, I end up needing to create a second DataFrame and joining them:

val_tests = \
    my_data.groupby("Category").agg (
        {
            "val_2"     : lambda arr : sp.stats.ttest_1samp(arr, popmean=0)[1]
        }
    ) \
    .rename (columns={"val_2" : "p_val_2"})

results = pd.merge(counts_and_means, val_tests, left_index=True, right_index=True)

My question: is there some way to do this all in one agg(...) step, without having to create a second result DataFrame and performing the merge?

(See my other closely-related agg question here.)

Community · Accepted Answer · 2017-05-23 12:06:34Z

2

You could try this. I'm passing a list of functions for val_2

def ttest(arr):     
    return stats.ttest_1samp(arr, popmean=0)[1]

counts_and_means = \
    my_data.groupby("Category").agg (
        {
            "Category"  : np.count_nonzero ,
            "val_1"     : np.mean ,
            "val_2"     : [np.mean,ttest]
        }
    )

if you prefer a slightly shorter version

counts_and_means = \
    my_data.groupby("Category").agg (
        {
            "Category"  : np.count_nonzero ,
            "val_1"     : np.mean ,
            "val_2"     : [np.mean,lambda arr : stats.ttest_1samp(arr, popmean=0)[1]]
        }
    )

This yields

     Category      val_1            val_2
         count_nonzero  mean        mean    ttest
Category                
cat1     3059           0.007861    2990.997712 0
cat2     2941          -0.007450    3010.383543 0

Per your comment below, you can "flatten" your columns

You can drop a level

counts_and_means.columns = counts_and_means.columns.droplevel()
counts_and_means

but that removes the top level leaving you duplicate column names

           count_nonzero    mean        mean       ttest
Category                
cat1       3059             0.007861    2990.997712 0
cat2       2941            -0.007450    3010.383543 0

This might be a better option. It uses a list comp to concat your column names which I got from here

counts_and_means.columns = ['%s%s' % (a, '_%s' % b if b else '') 
                           for a, b in counts_and_means.columns]
counts_and_means

            Category_count_nonzero  val_1_mean  val_2_mean   val_2_ttest
Category                
cat1        3059                    0.007861    2990.997712  0
cat2         2941                  -0.007450    3010.383543  0

edited May 23, 2017 at 12:06

CommunityBot

11 silver badge

answered Apr 19, 2015 at 1:53

Bob Haffner

8,5231 gold badge40 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sparc_spread Over a year ago

This looks great, though I must confess I am new to MultiIndex, which is what the resulting DatafFrame has. Do you know a way of "flattening" the MultiIndex into a regular 1-dimensional index of columns?

Bob Haffner Over a year ago

Yeah, i think I know what you mean. I'll modify my answer

Collectives™ on Stack Overflow

Pandas: Aggregating and applying multiple functions to same column

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related