Get count of duplicate rows in DISTINCT ON

Question

In any SELECT query, where a DISTINCT ON is used, how can one additionally get the number of duplicates for each row in the result set?

Take e.g.

SELECT
  DISTINCT ON (building)
  building,
  name
FROM ...
WHERE ...

This will only return the first result for each building. I want to add another column, so the results look like this:

name | building | excluded
Fred | Office   | 0
Bob  | Storage  | 3

when there are more people than Bob in Storage. I'm using Postgres 10.

score 7 · Accepted Answer · 2018-06-26 09:15:30Z

7

You can use a window function:

with data (name, building) as (
  values 
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Fred', 'Office'),
    ('Tim', 'Home'),
    ('Tim', 'Home')
)
select distinct on (building) *, 
       count(*) over (partition by building) - 1 as excluded
from data
order by building;

returns:

name | building | excluded
-----+----------+---------
Tim  | Home     |        1
Fred | Office   |        0
Bob  | Storage  |        3

This works because the window function is evaluated before the distinct on ()

However this means doing some work twice. I think it might be faster to re-use the partitioning "work" to also filter out the duplicates:

with ranked as (
  select *, 
         count(*) over w - 1 as excluded, 
         row_number() over w as rn
  from your_table
  window w as (partition by building)
) 
select *
from ranked
where rn = 1;

edited Jun 26, 2018 at 9:15

answered Jun 26, 2018 at 9:08

user330315

Sign up to request clarification or add additional context in comments.

1 Comment

turbo Over a year ago

Your last approach even answers a follow up question I was about to ask wrt conditional deduplication. Neat.

klin · Accepted Answer · 2018-06-26 09:14:05Z

1

You can simply use group by instead of distinct on (to avoid window functions):

with data (name, building) as (
  values 
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Fred', 'Office'),
    ('Tim', 'Home'),
    ('Tim', 'Home')
)   
select min(name), building, count(*)- 1 as excluded
from data
group by building
order by building;

 min  | building | excluded 
------+----------+----------
 Tim  | Home     |        1
 Fred | Office   |        0
 Bob  | Storage  |        3
(3 rows)

answered Jun 26, 2018 at 9:14

klin

123k15 gold badges241 silver badges263 bronze badges

3 Comments

turbo Over a year ago

Is there any significant difference in e.g. performance on large tables using GROUP instead of DISTINCT?

user330315 Over a year ago

The group by approach only works as long as no additional columns need to be select - but in that case I think it's indeed more efficient

klin Over a year ago

Yes, the difference should be significant in favor of group by. You have to use aggregates for other columns (like min(name) in the example).

gpeche · Accepted Answer · 2018-06-26 09:10:39Z

-1

Use window functions?

select
first_value(name) over (partition by building order by /* your order */) first_name
first_value(building) over (partition by building order by  /* your order */) building,
count(*) over (partition by building order by /* your order */) - 1 as excluded
from (
    select name, building
    from my_source_table
);

answered Jun 26, 2018 at 9:10

gpeche

22.7k5 gold badges40 silver badges52 bronze badges

Collectives™ on Stack Overflow

Get count of duplicate rows in DISTINCT ON

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related