I have a pandas DataFrame like this:
n = 6000
my_data = DataFrame ({
"Category" : np.random.choice (['cat1','cat2'], size=n) ,
"val_1" : np.random.randn(n) ,
"val_2" : [i for i in range (1,n+1)]
})
I want to calculate the count of one column and the means of the other two, aggregating by Category. This is described in the pandas documentation as "Applying different functions to DataFrame columns", and I do it like this:
counts_and_means = \
my_data.groupby("Category").agg (
{
"Category" : np.count_nonzero ,
"val_1" : np.mean ,
"val_2" : np.mean
}
)
I also want to calculate a t-test p-variable for val_2, testing the hypothesis that the mean of val_2 is zero. If val_2 were the only column I was doing anything with throughout this whole process, I could just do what is described in the Pandas documentation as "Applying multiple functions at once." However, I'm trying to do both multiple columns AND multiple functions. I can explicitly name output columns when it's just the "multiple functions at once" case, but I can't figure out how to do it when there are also multiple columns involved. Right now when I try to do this all in one agg(...) step, the val_2 p-value column definition overwrites the original mean column definition, because they're both in the same dict. So, I end up needing to create a second DataFrame and joining them:
val_tests = \
my_data.groupby("Category").agg (
{
"val_2" : lambda arr : sp.stats.ttest_1samp(arr, popmean=0)[1]
}
) \
.rename (columns={"val_2" : "p_val_2"})
results = pd.merge(counts_and_means, val_tests, left_index=True, right_index=True)
My question: is there some way to do this all in one agg(...) step, without having to create a second result DataFrame and performing the merge?
(See my other closely-related agg question here.)