2

I'm looking to clean up event data that happens to have "duplicate" rows for a given day. I want to remove rows for a day that have more than one status based on the context of the next day's status value. Currently, I am using BigQuery and multiple CTE steps with self joins to iterate through days with multiple events to eventually "true up" every day to have a single status value.

I have tried using recursive CTEs with self joins, various window functions, etc without much luck. BigQuery doesn't allow analytic functions in recursive CTEs, including GROUP BYs :(

See below for an example of 2 iterations:

# data has multiple instances of days with more than one status (* = duplicate)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |*
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |*
| 2024-11-05 | active   |

# first iteration with removed rows (**)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |** (2024-11-02 is inactive, so remove this row)
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-03 | inactive |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |** (2024-11-05 is active, so remove this row)
| 2024-11-05 | active   |

# second iteration with removed rows (***)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |**
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*** (2024-11-04 has been deduped to active, so remove this row)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |**
| 2024-11-05 | active   |

# final desired set of deduplicated rows
| date       | status   |
|------------|----------|
| 2024-11-01 | inactive |
| 2024-11-02 | inactive |
| 2024-11-03 | active   |
| 2024-11-04 | active   |
| 2024-11-05 | active   |

I can imagine having to iterate N-times given the size of the data. Is there a recursive approach to this problem in SQL? Thanks!

1 Answer 1

2
  • CTE "a" sets status to NULL for dates with multiple statuses.
  • CTE "b" uses FIRST_VALUE to find next known status for dates with NULL status.
WITH a AS (
  SELECT date, IF(COUNT(DISTINCT status) = 1, MIN(status), NULL) AS status
  FROM sample_data
  GROUP BY date
),
b AS (
  SELECT
    date,
    COALESCE(
      status,
      FIRST_VALUE(status IGNORE NULLS) OVER (
        ORDER BY date
        ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
      )
    ) AS final_status
  FROM a
)
SELECT date, final_status AS status
FROM b
ORDER BY date;

Output:

date status
2024-11-01 inactive
2024-11-02 inactive
2024-11-03 active
2024-11-04 active
2024-11-05 active
Sign up to request clarification or add additional context in comments.

3 Comments

Nice! I think this might work, I'll give it a try and report back. Any reason to not use ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING?
Good spot. IGNORE NULLS allows to use ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING instead of ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING. Both will give the same result.
Yep, this works! I think COALESCE can be omitted with ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.