Recursive CTE to remove duplicates

Question

I'm looking to clean up event data that happens to have "duplicate" rows for a given day. I want to remove rows for a day that have more than one status based on the context of the next day's status value. Currently, I am using BigQuery and multiple CTE steps with self joins to iterate through days with multiple events to eventually "true up" every day to have a single status value.

I have tried using recursive CTEs with self joins, various window functions, etc without much luck. BigQuery doesn't allow analytic functions in recursive CTEs, including GROUP BYs :(

See below for an example of 2 iterations:

# data has multiple instances of days with more than one status (* = duplicate)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |*
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |*
| 2024-11-05 | active   |

# first iteration with removed rows (**)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |** (2024-11-02 is inactive, so remove this row)
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-03 | inactive |* (2024-11-04 has duplicates, so we can't derive yet)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |** (2024-11-05 is active, so remove this row)
| 2024-11-05 | active   |

# second iteration with removed rows (***)
| date       | status   |
|------------|----------|
| 2024-11-01 | active   |**
| 2024-11-01 | inactive |*
| 2024-11-02 | inactive |
| 2024-11-03 | active   |*
| 2024-11-03 | inactive |*** (2024-11-04 has been deduped to active, so remove this row)
| 2024-11-04 | active   |*
| 2024-11-04 | inactive |**
| 2024-11-05 | active   |

# final desired set of deduplicated rows
| date       | status   |
|------------|----------|
| 2024-11-01 | inactive |
| 2024-11-02 | inactive |
| 2024-11-03 | active   |
| 2024-11-04 | active   |
| 2024-11-05 | active   |

I can imagine having to iterate N-times given the size of the data. Is there a recursive approach to this problem in SQL? Thanks!

keithwalsh · Accepted Answer · 2024-11-26 00:22:46Z

2

CTE "a" sets status to NULL for dates with multiple statuses.
CTE "b" uses FIRST_VALUE to find next known status for dates with NULL status.

WITH a AS (
  SELECT date, IF(COUNT(DISTINCT status) = 1, MIN(status), NULL) AS status
  FROM sample_data
  GROUP BY date
),
b AS (
  SELECT
    date,
    COALESCE(
      status,
      FIRST_VALUE(status IGNORE NULLS) OVER (
        ORDER BY date
        ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING
      )
    ) AS final_status
  FROM a
)
SELECT date, final_status AS status
FROM b
ORDER BY date;

Output:

date	status
2024-11-01	inactive
2024-11-02	inactive
2024-11-03	active
2024-11-04	active
2024-11-05	active

answered Nov 26, 2024 at 0:22

keithwalsh

9286 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

alpacafondue Over a year ago

Nice! I think this might work, I'll give it a try and report back. Any reason to not use ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING?

keithwalsh Over a year ago

Good spot. IGNORE NULLS allows to use ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING instead of ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING. Both will give the same result.

alpacafondue Over a year ago

Yep, this works! I think COALESCE can be omitted with ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING.

Collectives™ on Stack Overflow

Recursive CTE to remove duplicates

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related