PySpark loop in groupBy aggregate function

Question

I have a big table for which I m trying to calculate sums (with conditions) of some columns grouping by a location.

My code looks like this, and I have more and more columns

df.groupBy(location_column).agg(
        F.sum(F.when(F.col(col1) == True, F.col(value))).alias("SUM " + col1),
        F.sum(F.when(F.col(col2) == True, F.col(value))).alias("SUM " + col2),
        F.sum(F.when(F.col(col3) == True, F.col(value))).alias("SUM " + col3),
        ....
        # Additional lines for additional columns (around 20)
)

I want to refactor my code to look like less dumb, by basically doing something like

cols = [col1, col2, col3, ... , coln]
df.groupBy(location_column).agg([F.sum(F.when(F.col(x) == True, F.col(value))).alias("SUM " + x)] for x in cols)

It's not working because the agg() function does not take lists :

assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"

Is there a solution to refactor it ? Thanks

mck · Accepted Answer · 2021-03-18 13:15:39Z

7

for x in cols should be inside the square brackets. You also need to put a * before the list comprehension to expand the arguments:

df.groupBy(location_column).agg(
    *[F.sum(F.when(F.col(x) == True, F.col(value))).alias("SUM " + x) for x in cols]
)

answered Mar 18, 2021 at 13:15

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark loop in groupBy aggregate function

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related