0

My understanding of spark windows is as follows:

current row (1) -> window rows (1 or more) -> aggregation func. -> output for the current row (1) 

where a single row can be included in multiple windows. The aggregation function f is called with f.over(window), which limits the window scope to only a single function. For example, I cannot apply filter(), especially not a dynamic one, on only window rows before aggregating with sum().ower(window).

To do custom processing of the window rows, I can:
a) write UDF which gets window rows as input
b) use collect_list() to get window rows as a list for each row and continue processing on these lists

Is there any other option to use multiple standard spark functions on the same window rows?

1 Answer 1

1

The filter usecase can be achieved by applying sum over a conditional expression. It's possible to use multiple spark functions over the same window. For example, the below spark snippet is a valid.

(df.withColumn("a", f.sum().over(window))
   .withColumn("b", f.first().over(window))
)

If you are looking to apply custom functions then you can write User Defined Aggregate Function (UDAF) using Scala or Java. In your only option is python then collect_list and UDF is the way to go.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.