I have a dataframe ordered by date. Each row contains a "flag" column (values are 1 or 0). I want to find the sequences of 3 (or more) consecutive rows with the "flag" value equal to 1. The objective is to reset the "flag" value to 0 if it is not part of a 3 or more consecutive elements sequence with the :flag" being 1.
This is an example of the data:
| date | flag |
|---|---|
| 01-01-2022 | 1 |
| 02-01-2022 | 1 |
| 03-01-2022 | 1 |
| 04-01-2022 | 1 |
| 05-01-2022 | 0 |
| 06-01-2022 | 0 |
| 07-01-2022 | 1 |
| 08-01-2022 | 1 |
| 09-01-2022 | 0 |
| 10-01-2022 | 1 |
We have to keep value 1 only for the first 4 rows, as they constitute a sequence of four rows (more than 3) with a 1 in the flag. The desired output should be:
| date | flag |
|---|---|
| 01-01-2022 | 1 |
| 02-01-2022 | 1 |
| 03-01-2022 | 1 |
| 04-01-2022 | 1 |
| 05-01-2022 | 0 |
| 06-01-2022 | 0 |
| 07-01-2022 | 0 |
| 08-01-2022 | 0 |
| 09-01-2022 | 0 |
| 10-01-2022 | 0 |
I thought that maybe it makes sense to use the lag function based on the previous element but not sure about how efficient it is in PySpark.