Efficient way to find count of duplicate records

Question

I am having only basic knowledge of SQL and this is my first for the tag SQL, so pardon me if I miss something in the first instance. There is a large database having, one table which is of particular interest, consists of more than 100 Million rows. Client is using PostgreSQL for database.

The structure of this table is something like this (sample data) table name say table

  biometric_token ID status State_code
1            ABCD  1      A          1
2            ABCD  2      R          2
3            ABCD  3      A          2
4            BCDE  4      A          1
5            DEFG  5      A          2
6            DEFG  6      A          2
7            DEFG  7      A          3
8            FGHI  8      A          4

column biometric_token is biometric based id which I can safely assume is really unique per person and he/she cannot fake it. Needless to say, this token is collected in end of the registration process to which other three columns pertain.
ID is system generated unique ID given to every individual when he/she approach for registration.
status is either approved - A or rejected - R.
State_code is code of state/county in which individual is registered.

What I need is to find how many (count only) individuals (i.e. having same biometric_token) have been registered more than once. So the output format be something like

  count_of_individuals number_of_times
1                    1               3
2                    1               2

Explanation

ABCD successfully registered 2 times
DEFG successfully registered 3 times

Code (nested query) that I have tried, that's working but very slow, is given below. I need to find if there is any faster alternative which can successfully retrieve this summary from this huge database

SELECT COUNT( DISTINCT biometric_token) AS count_of_individuals, cnt AS number_of_times
FROM

(SELECT biometric_token, count (ID) AS cnt
FROM table
WHERE status IN ('A')
GROUP BY biometric_token
HAVING cnt > 1)

GROUP BY number_of_times

Tim Biegeleisen · Accepted Answer · 2021-06-16 04:22:13Z

2

I would probably also have suggested your current query. But, we can try adding the following index to your table:

CREATE INDEX idx ON yourTable (status, biometric_token);

This index should cover the entire query, and would at least allow Postgres to only have to aggregate over a subset of the table consisting of the approved registration records.

answered Jun 16, 2021 at 4:22

Tim Biegeleisen

526k32 gold badges324 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

AnilGoyal Over a year ago

can you please explain a bit further, as to what it will do exactly and after creating this index, how I'll have to proceed or change my select query

Tim Biegeleisen Over a year ago

@AnilGoyal A full explanation of how a B-tree index works in SQL is out of scope for a single question/answer. The first column in the index status will immediately let Postgres discard any records other than those which have been approved. The second column biometric_token is there to allow Postgres to do the aggregation using the index alone.

Nick.Mc Over a year ago

how I'll have to proceed or change my select query You don't have to change your select query. 99% of the time it uses the index automatically

Collectives™ on Stack Overflow

Efficient way to find count of duplicate records

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related