1

In any SELECT query, where a DISTINCT ON is used, how can one additionally get the number of duplicates for each row in the result set?

Take e.g.

SELECT
  DISTINCT ON (building)
  building,
  name
FROM ...
WHERE ...

This will only return the first result for each building. I want to add another column, so the results look like this:

name | building | excluded
Fred | Office   | 0
Bob  | Storage  | 3

when there are more people than Bob in Storage. I'm using Postgres 10.

3 Answers 3

7

You can use a window function:

with data (name, building) as (
  values 
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Fred', 'Office'),
    ('Tim', 'Home'),
    ('Tim', 'Home')
)
select distinct on (building) *, 
       count(*) over (partition by building) - 1 as excluded
from data
order by building;

returns:

name | building | excluded
-----+----------+---------
Tim  | Home     |        1
Fred | Office   |        0
Bob  | Storage  |        3

This works because the window function is evaluated before the distinct on ()

However this means doing some work twice. I think it might be faster to re-use the partitioning "work" to also filter out the duplicates:

with ranked as (
  select *, 
         count(*) over w - 1 as excluded, 
         row_number() over w as rn
  from your_table
  window w as (partition by building)
) 
select *
from ranked
where rn = 1;
Sign up to request clarification or add additional context in comments.

1 Comment

Your last approach even answers a follow up question I was about to ask wrt conditional deduplication. Neat.
1

You can simply use group by instead of distinct on (to avoid window functions):

with data (name, building) as (
  values 
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Bob', 'Storage'),
    ('Fred', 'Office'),
    ('Tim', 'Home'),
    ('Tim', 'Home')
)   
select min(name), building, count(*)- 1 as excluded
from data
group by building
order by building;

 min  | building | excluded 
------+----------+----------
 Tim  | Home     |        1
 Fred | Office   |        0
 Bob  | Storage  |        3
(3 rows)

3 Comments

Is there any significant difference in e.g. performance on large tables using GROUP instead of DISTINCT?
The group by approach only works as long as no additional columns need to be select - but in that case I think it's indeed more efficient
Yes, the difference should be significant in favor of group by. You have to use aggregates for other columns (like min(name) in the example).
-1

Use window functions?

select
first_value(name) over (partition by building order by /* your order */) first_name
first_value(building) over (partition by building order by  /* your order */) building,
count(*) over (partition by building order by /* your order */) - 1 as excluded
from (
    select name, building
    from my_source_table
);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.