3

I'm very new to pyspark and I'm attempting to transition my pandas code to pyspark. One thing I'm having issues with is aggregating my groupby.

Here is the pandas code:

df_trx_m = train1.groupby('CUSTOMER_NUMBER')['trx'].agg(['mean', 'var'])

I saw this example on AnalyticsVidhya but I'm not sure how to apply that to the code above:

train.groupby('Age').agg({'Purchase': 'mean'}).show()
Output:
+-----+-----------------+
|  Age|    avg(Purchase)|
+-----+-----------------+
|51-55|9534.808030960236|
|46-50|9208.625697468327|
| 0-17|8933.464640444974|
|36-45|9331.350694917874|
|26-35|9252.690632869888|
|  55+|9336.280459449405|
|18-25|9169.663606261289|
+-----+-----------------+

Any help would be much apprecaited

EDIT:

Here's another attempt:

from pyspark.sql.functions import avg, variance
train1.groupby("CUSTOMER_NUMBER")\
    .agg(
        avg('repatha_trx').alias("repatha_trx_avg"), 
        variance('repatha_trx').alias("repatha_trx_Var")
    )\
    .show(100)

But that is just giving me an empty dataframe.

1

1 Answer 1

8

You can import pyspark functions to perform aggregation.

# load function
from pyspark.sql import functions as F

# aggregate data
df_trx_m = train.groupby('Age').agg(
    F.avg(F.col('repatha_trx')).alias('repatha_trx_avg'),
    F.variance(F.col('repatha_trx')).alias('repatha_trx_var')
)

Note that pyspark.sql.functions.variance() returns the population variance. There is another function pyspark.sql.functions.var_samp() for the unbiased sample variance.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.