How to pass a variable number of variables into a pyspark select expression

Question

I have a simple pyspark function

features=['x', 'y', 'z']
def f(features):
    df.groupBy('id').agg(collect_list(features[0]), collect_list(features[1]), ....)

I want it so that if someone passes in features=['x', 'y', 'z', 'a'] each thing in features will have its own collect_list function in the agg function. How can I do this? They all have to be in the same agg function

suresiva · Accepted Answer · 2020-07-07 22:53:32Z

2

features=['x', 'y', 'z']
def f(features):
    df.groupBy("id").agg(*[collect_list(feature) for feature in features ])

features array elements will be iterated inside agg function, and one aggregated column will be created for each feature.

To derive custom column names for the aggregated columns,

df.groupBy("id").agg(*[F.collect_list(feature).alias("%s_list" % (feature)) for feature in features ])

Please refer this link for more details.

edited Jul 7, 2020 at 22:53

answered Jul 7, 2020 at 22:42

suresiva

3,1831 gold badge20 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

10 Rep Over a year ago

Thanks for answering! Try to edit your answer to include details about the code. Thanks!

Collectives™ on Stack Overflow

How to pass a variable number of variables into a pyspark select expression

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related