How to apply groupBy and aggregate functions to a specific window in a PySpark DataFrame?

Question

I would like to apply a groupBy and a subsequent agg function to a PySpark DataFrame, but only to a specific window. This is best illustrated by an example. Suppose that I have a dataset named df:

df.show()

    +-----+----------+----------+-------+
    |   ID| Timestamp| Condition|  Value|
    +-----+----------+----------+-------+
    |   z1|         1|         0|     50|
|-------------------------------------------|
|   |   z1|         2|         0|     51|   |
|   |   z1|         3|         0|     52|   |
|   |   z1|         4|         0|     51|   |
|   |   z1|         5|         1|     51|   |
|   |   z1|         6|         0|     49|   |
|   |   z1|         7|         0|     44|   |
|   |   z1|         8|         0|     46|   |
|-------------------------------------------|
    |   z1|         9|         0|     48|
    |   z1|        10|         0|     42|
 +-----+----------+----------+-------+

Particularly, what I would like to do is to apply a kind of window of +- 3 rows to the row where column Condition == 1 (i.e. in this case, row 5). Within that window, as depicted in the above DataFrame, I would like to find the minimum value of column Value and the corresponding value of column Timestamp, thus obtaining:

+----------+----------+
| Min_value| Timestamp|
+----------+----------+
|        44|         7|
+----------+----------+

Does anyone know how this can be tackled?

Many thanks in advance

Marioanzas

mck · Accepted Answer · 2021-02-10 09:59:22Z

1

You can use a window that spans between 3 preceding and 3 following rows, get the minimum, and filter the condition:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'min',
    F.min(
        F.struct('Value', 'Timestamp')
    ).over(Window.partitionBy('ID').orderBy('Timestamp').rowsBetween(-3,3))
).filter('Condition = 1').select('min.*')

df2.show()
+-----+---------+
|Value|Timestamp|
+-----+---------+
|   44|        7|
+-----+---------+

edited Feb 10, 2021 at 9:59

answered Feb 10, 2021 at 8:27

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Marioanzas Over a year ago

Hi @mck! Many thanks for your suggestion. However, this would definitely work for datasets where there is only one row containing Condition == 1. However, I do not think if would work for datasets where I have two or more rows with Condition == 1. Do you agree?

mck Over a year ago

@Marioanzas yes, I realised that was a bad solution. I rewrote it and it should hopefully do the job much better now

Collectives™ on Stack Overflow

How to apply groupBy and aggregate functions to a specific window in a PySpark DataFrame?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related