Postgres Large table Optimal index selection for Select Query With In Clause

Question

We have a very large table with a total number of rows around ~4 billion and everyday throughput is around ~20-25 million.

Below is the Sample Table definition.

table_name (
   id bigint,
   user_id bigint,
   status  smallint,
   date    timestamp,
   .
   .
   some more columns
);

NOTE: status is an enum and can have 6-7 possible values.

Below is the query that we want to optimise:

SELECT * 
FROM table_name 
WHERE user_id = $user_id 
AND status in (1, 2, 3, 6, 7, 10) 
ORDER BY date 
DESC LIMIT 20;

Initially, we were having index1 on table_name (user_id, status).

Since this index was not giving optimal performance. We thought of including the date in the index as well.

Now, we have tried creating a bunch of different indexes on the table, but the explain plan always picks up the initial index i.e: index1.

Below are the indexes that we tried:

index2: table_name (user_id, status, date)
index3: table_name (user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))
index4: table_name (user_id, status) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]));

Below is the explain analyse output:

Limit  (cost=27029.07..27029.12 rows=20 width=251) (actual time=32.466..32.470 rows=20 loops=1)
 ->  Sort  (cost=27029.07..27049.17 rows=8041 width=251) (actual time=32.465..32.467 rows=20 loops=1)
      Sort Key: id DESC
      Sort Method: top-N heapsort  Memory: 38kB
    ->  Index Scan using index1 on table_name wt  (cost=0.58..26815.10 rows=8041 width=251) (actual time=0.027..26.991 rows=37362 loops=1)
         Index Cond: ((user_id = 111111) AND (status = ANY ('{1,3,2,7,6,10,8}'::integer[])))

Planning Time: 0.320 ms
Execution Time: 32.498 ms

Our database postgres version is: 12.7 and we run vaccuming regularly.

We want to understand why other indexes are not being used for our query.

Also given our use case, Can there be a better way of creating index so that we can serve the query in acceptable response time?

1) A less specific index is redundant and therefore I would delete it (index2 should be sufficient) 2) postgresql.org/docs/12/planner-stats.html — David Lukas
– David Lukas, Commented Mar 10, 2022 at 13:37
32 milliseconds seems quite OK - how fast do you need that to be? — user330315
– user330315, Commented Mar 10, 2022 at 13:39
@a_horse_with_no_name :- when we run the query for the first time for a user it takes almost 25-30 seconds, the explain analyse plan I have shared is for subsequent query. I am guessing after first page data page is loaded in memory and we are getting better results. — Mohd Waseem
– Mohd Waseem, Commented Mar 10, 2022 at 15:41
If you use explain (analyze, buffers) this would be visible in the plan — user330315
– user330315, Commented Mar 10, 2022 at 15:44
You need to capture the plan that is actually the problem. You should be able to just pick a new user and show us the explain (analyze, buffers) for that one. — jjanes
– jjanes, Commented Mar 10, 2022 at 17:35

jian · Accepted Answer · 2022-03-10 16:14:49Z

3

CREATE INDEX table_name_desc_index ON table_name (userid,_date DESC NULLS LAST);

Then try following:

SELECT *
FROM table_name
inner join (values (1),(2),(3),(6),(7),(10)) val(v) ON (table_name.status = v )
WHERE user_id = $user_id
ORDER BY date
DESC LIMIT 20;

answered Mar 10, 2022 at 16:14

jian

5,0231 gold badge29 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mohd Waseem Over a year ago

Thanks @Mark: this index is giving us better performance: table_name (user_id asc, date desc) after slightly updating the query and adding nulls last to order by clause.

Umut TEKİN Over a year ago

@MohdWaseem, could you share new explain plan output?

jjanes · Accepted Answer · 2022-03-10 17:55:51Z

You are selecting *, so you will not be able to get an index-only scan due to the * dragging in more columns than are in the index. The only advantage of those other indexes you showed (that I can see) would be to enable index-only scans, so if that cannot work, it is not surprising those indexes are not selected for use. You could test this theory by changing the * to just the columns appearing in the index to see what happens.

About one of your indexes:

(user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))

This index seems rather pointless. The benefit of the WHERE clause is that it reduces the disparate qualifying values of "status" down to a single value ('true'). But then putting "status" into the index body just breaks them back up again. The better index would be:

(user_id, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))

This one can jump to the end of the specific user_id, scan backwards (to fulfill the order by date desc) and stop once it finds 20 rows. When you have "status" as an interloper, it prevents it from doing that.

Collectives™ on Stack Overflow

Postgres Large table Optimal index selection for Select Query With In Clause

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related