2

I would like to know, in Postgres, how to delete all duplicate records but one by sorting by a column.

Let's say I have the following table foo:

 id                 | name |  region   |          created_at
--------------------+------+-----------+-------------------------------
                  1 | foo  | sydney    | 2018-05-24 15:40:32.593745+10
                  2 | foo  | melbourne | 2018-05-24 17:28:59.452225+10
                  3 | foo  | sydney    | 2018-05-29 22:17:02.927263+10
                  4 | foo  | sydney    | 2018-06-13 16:44:32.703174+10
                  5 | foo  | sydney    | 2018-06-13 16:45:01.324273+10
                  6 | foo  | sydney    | 2018-06-13 17:04:49.487767+10
                  7 | foo  | sydney    | 2018-06-13 17:05:13.592844+10

I would like to delete all duplicates by checking for (name, region) tuple, but keep the one with the greatest created_at column. The result would be:

 id                 | name |  region   |          created_at
--------------------+------+-----------+-------------------------------
                  2 | foo  | melbourne | 2018-05-24 17:28:59.452225+10
                  7 | foo  | sydney    | 2018-06-13 17:05:13.592844+10

But I do not know where to begin. Any ideas?

2 Answers 2

1

Use a subquery with ROW_NUMBER and PARTITION BY to filter out rows that have duplicate regions while retaining the most recent in each region. Ensure your subquery uses the AS keyword to prevent Postgre syntax errors:

SELECT * 
FROM foo 
WHERE id IN (
  SELECT a.id 
  FROM (
    SELECT id, ROW_NUMBER() OVER (
        PARTITION BY region 
        ORDER BY created_at DESC
    ) row_no 
    FROM foo
  ) AS a 
  WHERE row_no > 1
);

... returns the rows to be deleted. Replace SELECT * with DELETE when you're satisfied with the result to delete the rows.

SQLFiddle demo

Sign up to request clarification or add additional context in comments.

2 Comments

Why are we using PARTITION BY region rather than PARTITION BY region, name? What is the difference?
Your example indicates your dataset has the same value for name across the board - adding it to the PARTITION BY would be immaterial to the output for the example.
1
DELETE FROM foo
      WHERE id IN
               (SELECT id
                  FROM (SELECT id,
                               ROW_NUMBER ()
                               OVER (PARTITION BY region
                                     ORDER BY created_at DESC)
                                  row_no
                          FROM foo)
                 WHERE row_no > 1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.