How can I use PostgreSQL's DISTINCT ON clause to also return a count of the duplicates?

Question

Suppose I have a table like this

+--------+--------+------+--------+---------+
|   A    |   B    |  C   |   g    |    h    |
+--------+--------+------+--------+---------+
| cat    | dog    | bird | 34.223 |  54.223 |
| cat    | pigeon | goat |  23.23 |  54.948 |
| cat    | dog    | bird | 17.386 |  26.398 |
| gopher | pigeon | bird | 23.552 |  89.223 |
+--------+--------+------+--------+---------+

but with many more fields to the right (i, j, k, ...).

I need a resulting table that looks like:

+-----+--------+------+-----+-----+-----+-----+-------+
|  A  |   B    |  C   |  g  |  h  | ... |  z  | count |
+-----+--------+------+-----+-----+-----+-----+-------+
| cat | dog    | bird | xxx | xxx |     | xxx |    23 |
| cat | pigeon | goat | xxx | xxx |     | xxx |    78 |
+-----+--------+------+-----+-----+-----+-----+-------+

I would normally use a GROUP BY, but I don't want to have to repeat all of the column names (g, h, i, ... z).

I can currently get the result I want using a window function combined with DISTINCT ON, but the query is very slow to run (500k+ records), and has a lot of duplication

WITH temp AS (
    SELECT a, b, c, COUNT(*)
    FROM my_table
    GROUP BY a, b, C
)
SELECT DISTINCT ON (a, b, c) *, (
    SELECT count
    FROM temp
    WHERE 
        temp.a = t.a 
        AND temp.b = t.b 
        AND temp.c = t.c
) as count
FROM my_table as t
ORDER BY a, b, c, x, y;

Is there a way to somehow get the count of the rows that were elimated with DISTINCT in a more efficient manner? Something like

SELECT DISTINCT ON (a, b, c)
    *, COUNT(*)
FROM my_table
ORDER BY a, b, c, count;

Or am I taking the wrong approach to begin with?

404 · Accepted Answer · 2018-11-29 19:51:16Z

2

Use COUNT() with PARTITION BY:

SELECT DISTINCT ON (a, b, c) *, COUNT(*) OVER (PARTITION BY a, b, c)
FROM my_table

You should probably also add an ORDER to your query if you care at all about the rest of the fields, otherwise the rows used to get the data displayed in those fields may be inconsistent.

edited Nov 29, 2018 at 19:51

answered Nov 29, 2018 at 19:05

404

8,7622 gold badges34 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How can I use PostgreSQL's DISTINCT ON clause to also return a count of the duplicates?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related