pyspark groupBy with multiple aggregates (like pandas)

Question

I'm very new to pyspark and I'm attempting to transition my pandas code to pyspark. One thing I'm having issues with is aggregating my groupby.

Here is the pandas code:

df_trx_m = train1.groupby('CUSTOMER_NUMBER')['trx'].agg(['mean', 'var'])

I saw this example on AnalyticsVidhya but I'm not sure how to apply that to the code above:

train.groupby('Age').agg({'Purchase': 'mean'}).show()
Output:
+-----+-----------------+
|  Age|    avg(Purchase)|
+-----+-----------------+
|51-55|9534.808030960236|
|46-50|9208.625697468327|
| 0-17|8933.464640444974|
|36-45|9331.350694917874|
|26-35|9252.690632869888|
|  55+|9336.280459449405|
|18-25|9169.663606261289|
+-----+-----------------+

Any help would be much apprecaited

EDIT:

Here's another attempt:

from pyspark.sql.functions import avg, variance
train1.groupby("CUSTOMER_NUMBER")\
    .agg(
        avg('repatha_trx').alias("repatha_trx_avg"), 
        variance('repatha_trx').alias("repatha_trx_Var")
    )\
    .show(100)

But that is just giving me an empty dataframe.

Your second attempt looks like it should work. Can you provide an minimal reproducible example that reproduces this issue? Please provide a small sample dataframe. Read more on how to make good reproducible apache spark dataframe examples. — pault
– pault, Commented Apr 5, 2018 at 14:57

pault · Accepted Answer · 2018-04-05 14:52:34Z

8

You can import pyspark functions to perform aggregation.

# load function
from pyspark.sql import functions as F

# aggregate data
df_trx_m = train.groupby('Age').agg(
    F.avg(F.col('repatha_trx')).alias('repatha_trx_avg'),
    F.variance(F.col('repatha_trx')).alias('repatha_trx_var')
)

Note that pyspark.sql.functions.variance() returns the population variance. There is another function pyspark.sql.functions.var_samp() for the unbiased sample variance.

edited Apr 5, 2018 at 14:52

pault

43.7k17 gold badges121 silver badges161 bronze badges

answered Apr 5, 2018 at 3:54

YOLO

22k5 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark groupBy with multiple aggregates (like pandas)

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related