4

I would like to groupBy my spark df with custom agg function:

def gini(list_of_values):
    sth is processing here
    return number output

enter image description here

I would like to get sth like that:

df.groupby('activity')['mean_event_duration_in_hours].agg(gini)

Could you please help me to resolve this tackle?

1 Answer 1

3

You can create a udf like so:

import pyspark.sql.functions as F
from pyspark.sql.types import FloatType

def gini(list_of_values):
    # sth is processing here
    return number_output

udf_gini = F.udf(gini, FloatType())

df.groupby('activity')\
    .agg(F.collect_list("mean_event_duration_in_hours").alias("event_duration_list"))\
    .withColumn("gini", udf_gini(F.col("event_duration_list")))

Or define gini as a UDF like this:

@udf(returnType=FloatType())
def gini(list_of_values):
    # sth is processing here
    return number_output

df.groupby('activity')\
    .agg(F.collect_list("mean_event_duration_in_hours").alias("event_duration_list"))\
    .withColumn("gini", gini(F.col("event_duration_list")))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.