Apply multiple functions over spark window rows

Question

My understanding of spark windows is as follows:

current row (1) -> window rows (1 or more) -> aggregation func. -> output for the current row (1)

where a single row can be included in multiple windows. The aggregation function f is called with f.over(window), which limits the window scope to only a single function. For example, I cannot apply filter(), especially not a dynamic one, on only window rows before aggregating with sum().ower(window).

To do custom processing of the window rows, I can:
a) write UDF which gets window rows as input
b) use collect_list() to get window rows as a list for each row and continue processing on these lists

Is there any other option to use multiple standard spark functions on the same window rows?

Nithish · Accepted Answer · 2022-08-10 17:08:16Z

1

The filter usecase can be achieved by applying sum over a conditional expression. It's possible to use multiple spark functions over the same window. For example, the below spark snippet is a valid.

(df.withColumn("a", f.sum().over(window))
   .withColumn("b", f.first().over(window))
)

If you are looking to apply custom functions then you can write User Defined Aggregate Function (UDAF) using Scala or Java. In your only option is python then collect_list and UDF is the way to go.

answered Aug 10, 2022 at 17:08

Nithish

3,2572 gold badges11 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Apply multiple functions over spark window rows

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related