4

We have a very large table with a total number of rows around ~4 billion and everyday throughput is around ~20-25 million.

Below is the Sample Table definition.

table_name (
   id bigint,
   user_id bigint,
   status  smallint,
   date    timestamp,
   .
   .
   some more columns
);

NOTE: status is an enum and can have 6-7 possible values.

Below is the query that we want to optimise:

SELECT * 
FROM table_name 
WHERE user_id = $user_id 
AND status in (1, 2, 3, 6, 7, 10) 
ORDER BY date 
DESC LIMIT 20;

Initially, we were having index1 on table_name (user_id, status).

Since this index was not giving optimal performance. We thought of including the date in the index as well.

Now, we have tried creating a bunch of different indexes on the table, but the explain plan always picks up the initial index i.e: index1.

Below are the indexes that we tried:

index2: table_name (user_id, status, date)
index3: table_name (user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))
index4: table_name (user_id, status) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]));

Below is the explain analyse output:

Limit  (cost=27029.07..27029.12 rows=20 width=251) (actual time=32.466..32.470 rows=20 loops=1)
 ->  Sort  (cost=27029.07..27049.17 rows=8041 width=251) (actual time=32.465..32.467 rows=20 loops=1)
      Sort Key: id DESC
      Sort Method: top-N heapsort  Memory: 38kB
    ->  Index Scan using index1 on table_name wt  (cost=0.58..26815.10 rows=8041 width=251) (actual time=0.027..26.991 rows=37362 loops=1)
         Index Cond: ((user_id = 111111) AND (status = ANY ('{1,3,2,7,6,10,8}'::integer[])))

Planning Time: 0.320 ms
Execution Time: 32.498 ms

Our database postgres version is: 12.7 and we run vaccuming regularly.

We want to understand why other indexes are not being used for our query.

Also given our use case, Can there be a better way of creating index so that we can serve the query in acceptable response time?

7
  • 1) A less specific index is redundant and therefore I would delete it (index2 should be sufficient) 2) postgresql.org/docs/12/planner-stats.html Commented Mar 10, 2022 at 13:37
  • 1
    32 milliseconds seems quite OK - how fast do you need that to be? Commented Mar 10, 2022 at 13:39
  • @a_horse_with_no_name :- when we run the query for the first time for a user it takes almost 25-30 seconds, the explain analyse plan I have shared is for subsequent query. I am guessing after first page data page is loaded in memory and we are getting better results. Commented Mar 10, 2022 at 15:41
  • If you use explain (analyze, buffers) this would be visible in the plan Commented Mar 10, 2022 at 15:44
  • You need to capture the plan that is actually the problem. You should be able to just pick a new user and show us the explain (analyze, buffers) for that one. Commented Mar 10, 2022 at 17:35

2 Answers 2

3
CREATE INDEX table_name_desc_index ON table_name (userid,_date DESC NULLS LAST);

Then try following:

SELECT *
FROM table_name
inner join (values (1),(2),(3),(6),(7),(10)) val(v) ON (table_name.status = v )
WHERE user_id = $user_id
ORDER BY date
DESC LIMIT 20;
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Mark: this index is giving us better performance: table_name (user_id asc, date desc) after slightly updating the query and adding nulls last to order by clause.
@MohdWaseem, could you share new explain plan output?
3

You are selecting *, so you will not be able to get an index-only scan due to the * dragging in more columns than are in the index. The only advantage of those other indexes you showed (that I can see) would be to enable index-only scans, so if that cannot work, it is not surprising those indexes are not selected for use. You could test this theory by changing the * to just the columns appearing in the index to see what happens.

About one of your indexes:

(user_id, status, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))

This index seems rather pointless. The benefit of the WHERE clause is that it reduces the disparate qualifying values of "status" down to a single value ('true'). But then putting "status" into the index body just breaks them back up again. The better index would be:

(user_id, date) where (status = ANY (ARRAY [1, 2, 3, 6, 7, 10]))

This one can jump to the end of the specific user_id, scan backwards (to fulfill the order by date desc) and stop once it finds 20 rows. When you have "status" as an interloper, it prevents it from doing that.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.