0

I have a simple pyspark function

features=['x', 'y', 'z']
def f(features):
    df.groupBy('id').agg(collect_list(features[0]), collect_list(features[1]), ....)

I want it so that if someone passes in features=['x', 'y', 'z', 'a'] each thing in features will have its own collect_list function in the agg function. How can I do this? They all have to be in the same agg function

0

1 Answer 1

2
features=['x', 'y', 'z']
def f(features):
    df.groupBy("id").agg(*[collect_list(feature) for feature in features ])

features array elements will be iterated inside agg function, and one aggregated column will be created for each feature.

To derive custom column names for the aggregated columns,

df.groupBy("id").agg(*[F.collect_list(feature).alias("%s_list" % (feature)) for feature in features ])

Please refer this link for more details.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for answering! Try to edit your answer to include details about the code. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.