I'm looking to clean up event data that happens to have "duplicate" rows for a given day. I want to remove rows for a day that have more than one status based on the context of the next day's status value. Currently, I am using BigQuery and multiple CTE steps with self joins to iterate through days with multiple events to eventually "true up" every day to have a single status value.
I have tried using recursive CTEs with self joins, various window functions, etc without much luck. BigQuery doesn't allow analytic functions in recursive CTEs, including GROUP BYs :(
See below for an example of 2 iterations:
# data has multiple instances of days with more than one status (* = duplicate)
| date | status |
|------------|----------|
| 2024-11-01 | active |*
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active |*
| 2024-11-03 | inactive |*
| 2024-11-04 | active |*
| 2024-11-04 | inactive |*
| 2024-11-05 | active |
# first iteration with removed rows (**)
| date | status |
|------------|----------|
| 2024-11-01 | active |** (2024-11-02 is inactive, so remove this row)
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-03 | inactive |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-04 | active |*
| 2024-11-04 | inactive |** (2024-11-05 is active, so remove this row)
| 2024-11-05 | active |
# second iteration with removed rows (***)
| date | status |
|------------|----------|
| 2024-11-01 | active |**
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active |*
| 2024-11-03 | inactive |*** (2024-11-04 has been deduped to active, so remove this row)
| 2024-11-04 | active |*
| 2024-11-04 | inactive |**
| 2024-11-05 | active |
# final desired set of deduplicated rows
| date | status |
|------------|----------|
| 2024-11-01 | inactive |
| 2024-11-02 | inactive |
| 2024-11-03 | active |
| 2024-11-04 | active |
| 2024-11-05 | active |
I can imagine having to iterate N-times given the size of the data. Is there a recursive approach to this problem in SQL? Thanks!