Delete duplicate equal rows in BigQuery

Question

There is a table with duplicates rows, where all column values are equal:

+------+---------+------------+
|  id  |  value  | timestamp  |
+------+---------+------------+
| 1    |  500    | 2019-10-12 |
| 2    |  400    | 2019-10-11 |
| 1    |  500    | 2019-10-12 |
+------+---------+------------+

I want to keep one of those equal rows and delete the others. I came up with:

DELETE
FROM
  `table` t1
WHERE (
  SELECT
    ROW_NUMBER() OVER (PARTITION BY id),
  FROM
    `table` t2
  WHERE
    t1.id = t2.id 
) > 1

However this does not work:

Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

Any ideas how to remove duplicate rows?

Mikhail Berlyant · Accepted Answer · 2020-04-28 16:56:11Z

1

Below is for BigQuery Standard SQL

... where all column values are equal
So you can use simple DISTINCT * and instead of DELETE use CREATE / REPLACE to write back to the same table

#standardSQL
CREATE OR REPLACE TABLE `project.dataset.table`   
PARTITION BY date
SELECT DISTINCT * 
FROM `project.dataset.table`

In PARTITION BY clause - you should add the fields you use to partition the original table

edited Apr 28, 2020 at 16:56

answered Apr 28, 2020 at 13:52

Mikhail Berlyant

174k10 gold badges173 silver badges251 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Delete duplicate equal rows in BigQuery

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related