Improve PostgreSQL query

Question

I have this query that is highly inefficient, if I remove all the count columns, it takes 10 seconds to query (the tables are quite large, around 750mb each). But if I add 1 count column, it takes 36 seconds to execute, if I leave it all in, it doesn't finish at all

I tried sum(case when r.value is not null then 1 else 0 end) in place of count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses, but it gets incorrect counts

   SELECT c.id,
            c.title,
            count(DISTINCT cc.*) AS contacts,
            count(DISTINCT m.user_id) AS texters,
            count(DISTINCT cc.*) FILTER (WHERE cc.assignment_id IS NULL) AS needs_assignment,
            count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = false) AS sent_messages,
            count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = true) AS received_messages,
            count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = false) AS contacted,
            count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = true) AS received_reply,
            count(DISTINCT cc.*) FILTER (WHERE cc.message_status = 'needsResponse'::text AND NOT cc.is_opted_out) AS needs_response,
            count(DISTINCT cc.*) FILTER (WHERE cc.is_opted_out = true) AS opt_outs,
            count(DISTINCT m.*) FILTER (WHERE m.error_code IS NOT NULL AND m.error_code <> 0) AS errors,
            c.is_started,
            c.is_archived,
            c.use_dynamic_assignment,
            c.due_by,
            c.created_at,
            creator.email AS creator_email,
            concat(c.join_token, '/join/', c.id) AS join_path,
            count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses,
            c.batch_size,
            c.texting_hours_start,
            c.texting_hours_end,
            c.timezone,
            c.organization_id
           FROM campaign c
             JOIN "user" creator ON c.creator_id = creator.id
             LEFT JOIN campaign_contact cc ON c.id = cc.campaign_id
             LEFT JOIN message m ON m.campaign_contact_id = cc.id
             LEFT JOIN question_response r ON cc.id = r.campaign_contact_id
          GROUP BY c.id, creator.email

Any direction is appreciated, thank you!

bobflux · Accepted Answer · 2021-02-19 09:40:13Z

Create some test data...

create unlogged table users( user_id serial primary key, login text unique not null );
insert into users (login) select 'user'||n from generate_series(1,100000) n;
create unlogged table messages( message_id serial primary key, sender_id integer not null, receiver_id integer not null);
insert into messages (sender_id,receiver_id) select random()*100000+1, random()*100000+1 from generate_series(1,1000000);
create index messages_s on messages(sender_id);
create index messages_r on messages(receiver_id);
vacuum analyze users,messages;

And then:

EXPLAIN ANALYZE 
SELECT user_id, count(DISTINCT m1.message_id), count(DISTINCT m2.message_id)
FROM users u
LEFT JOIN messages m1 ON m1.receiver_id = user_id
LEFT JOIN messages m2 ON m2.sender_id = user_id
GROUP BY user_id;

GroupAggregate  (cost=4.39..326190.22 rows=100000 width=20) (actual time=4.023..3331.031 rows=100000 loops=1)
   Group Key: u.user_id
   ->  Merge Left Join  (cost=4.39..250190.22 rows=10000000 width=12) (actual time=3.987..2161.032 rows=9998915 loops=1)
         Merge Cond: (u.user_id = m1.receiver_id)
         ->  Merge Left Join  (cost=2.11..56522.26 rows=1000000 width=8) (actual time=3.978..515.730 rows=1000004 loops=1)
               Merge Cond: (u.user_id = m2.sender_id)
               ->  Index Only Scan using users_pkey on users u  (cost=0.29..2604.29 rows=100000 width=4) (actual time=0.016..10.149 rows=100000 loops=1)
                     Heap Fetches: 0
               ->  Index Scan using messages_s on messages m2  (cost=0.42..41168.40 rows=1000000 width=8) (actual time=0.011..397.128 rows=999996 loops=1)
         ->  Materialize  (cost=0.42..43668.42 rows=1000000 width=8) (actual time=0.008..746.748 rows=9998810 loops=1)
               ->  Index Scan using messages_r on messages m1  (cost=0.42..41168.42 rows=1000000 width=8) (actual time=0.006..392.426 rows=999997 loops=1)
 Execution Time: 3432.131 ms

Since I put in 100k users and 1M messages, each user has about 100 messages as sender and 100 also as receiver, which means the joins generate 100*100=10k rows per user which then have to be processed by the count(DISTINCT ...) aggregates. Postgres doesn't realize this is all unnecessary because the counts and group by's should really be moved inside the joined tables, which means this is extremely slow.

The solution is to move the aggregation inside the joined tables manually, to avoid generating all these unnecessary extra rows.

EXPLAIN ANALYZE 
SELECT user_id, m1.cnt, m2.cnt
FROM users u
LEFT JOIN (SELECT receiver_id, count(*) cnt FROM messages GROUP BY receiver_id) m1 ON m1.receiver_id = user_id
LEFT JOIN (SELECT sender_id, count(*) cnt FROM messages GROUP BY sender_id) m2 ON m2.sender_id = user_id;

 Hash Left Join  (cost=46780.40..48846.42 rows=100000 width=20) (actual time=469.699..511.613 rows=100000 loops=1)
   Hash Cond: (u.user_id = m2.sender_id)
   ->  Hash Left Join  (cost=23391.68..25195.19 rows=100000 width=12) (actual time=237.435..262.545 rows=100000 loops=1)
         Hash Cond: (u.user_id = m1.receiver_id)
         ->  Seq Scan on users u  (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.015..5.162 rows=100000 loops=1)
         ->  Hash  (cost=22243.34..22243.34 rows=91867 width=12) (actual time=237.252..237.253 rows=99991 loops=1)
               Buckets: 131072  Batches: 1  Memory Usage: 5321kB
               ->  Subquery Scan on m1  (cost=20406.00..22243.34 rows=91867 width=12) (actual time=210.817..227.793 rows=99991 loops=1)
                     ->  HashAggregate  (cost=20406.00..21324.67 rows=91867 width=12) (actual time=210.815..222.794 rows=99991 loops=1)
                           Group Key: messages.receiver_id
                           Batches: 1  Memory Usage: 14353kB
                           ->  Seq Scan on messages  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.010..47.173 rows=1000000 loops=1)
   ->  Hash  (cost=22241.52..22241.52 rows=91776 width=12) (actual time=232.003..232.004 rows=99992 loops=1)
         Buckets: 131072  Batches: 1  Memory Usage: 5321kB
         ->  Subquery Scan on m2  (cost=20406.00..22241.52 rows=91776 width=12) (actual time=205.401..222.517 rows=99992 loops=1)
               ->  HashAggregate  (cost=20406.00..21323.76 rows=91776 width=12) (actual time=205.400..217.518 rows=99992 loops=1)
                     Group Key: messages_1.sender_id
                     Batches: 1  Memory Usage: 14353kB
                     ->  Seq Scan on messages messages_1  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.008..43.402 rows=1000000 loops=1)
 Planning Time: 0.574 ms
 Execution Time: 515.753 ms

I used a schema that is a bit different from yours, but you get the idea: instead of generating lots of duplicate rows by doing what is essentially a cross product, push aggregations into the joined tables so they return only one row per value of whatever column you're joining on, then remove the GROUP BY from the main query since it is no longer necessary.

Note that count(DISTINCT table.*) is not smart enough to understand that it can do this by looking only at the primary key of the table if there is one, so it will pull the whole row to run the distinct on it. When a table is named "message" or "question_response" it smells like it has a largish TEXT column in it, which will make this very slow. So in case you really need a count(distinct ...) you should use count(DISTINCT table.primarykey):

explain analyze SELECT count(distinct user_id) from users;
 Aggregate  (cost=1791.00..1791.01 rows=1 width=8) (actual time=15.220..15.221 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.016..5.830 rows=100000 loops=1)
 Execution Time: 15.263 ms

explain analyze SELECT count(distinct users.*) from users;
 Aggregate  (cost=1791.00..1791.01 rows=1 width=8) (actual time=90.896..90.896 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..1541.00 rows=100000 width=37) (actual time=0.038..38.497 rows=100000 loops=1)
 Execution Time: 90.958 ms

Laurenz Albe · Accepted Answer · 2021-02-19 08:25:24Z

1

The problem is the DISTINCT in the aggregate functions. PostgreSQL is not very smart about processing these.

No knowing your data model, I cannot tell if the DISTINCT is really needed. Omit it if you can.

answered Feb 19, 2021 at 8:25

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

1 Comment

sojim2 Over a year ago

Distinct on primary key helped!

Collectives™ on Stack Overflow

Improve PostgreSQL query

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related