PostgreSQL chooses not to use index despite improved performance

Question

I had a DB in MySQL and am in the process of moving to PostgreSQL with a Django front-end.

I have a table of 650k-750k rows on which I perform the following query:

SELECT "MMG", "Gene", COUNT(*) FROM at_summary_typing WHERE "MMG" != '' GROUP BY "MMG", "Gene" ORDER BY COUNT(*);

In the MySQL this returns in ~0.5s. However when I switched to PostgreSQL the same query takes ~3s. I have put an index on MMG and Gene together to try and speed it up but when using EXPLAIN (analyse, buffers, verbose) I see the output shows the index is not used :

 Sort  (cost=59013.54..59053.36 rows=15927 width=14) (actual time=2880.222..2885.475 rows=39314 loops=1)
   Output: "MMG", "Gene", (count(*))
   Sort Key: (count(*))
   Sort Method: external merge  Disk: 3280kB
   Buffers: shared hit=16093 read=11482, temp read=2230 written=2230
   ->  GroupAggregate  (cost=55915.50..57901.90 rows=15927 width=14) (actual time=2179.809..2861.679 rows=39314 loops=1)
         Output: "MMG", "Gene", count(*)
         Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
         ->  Sort  (cost=55915.50..56372.29 rows=182713 width=14) (actual time=2179.782..2830.232 rows=180657 loops=1)
               Output: "MMG", "Gene"
               Sort Key: at_summary_typing."MMG", at_summary_typing."Gene"
               Sort Method: external merge  Disk: 8168kB
               Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
               ->  Seq Scan on public.at_summary_typing  (cost=0.00..36821.60 rows=182713 width=14) (actual time=0.010..224.658 rows=180657 loops=1)
                     Output: "MMG", "Gene"
                     Filter: ((at_summary_typing."MMG")::text <> ''::text)
                     Rows Removed by Filter: 559071
                     Buffers: shared hit=16093 read=11482
 Total runtime: 2888.804 ms

After some searching I found that I could force the use of the index by setting SET enable_seqscan = OFF; and the EXPLAIN now shows the following :

Sort  (cost=1181591.18..1181631.00 rows=15927 width=14) (actual time=555.546..560.839 rows=39314 loops=1)
   Output: "MMG", "Gene", (count(*))
   Sort Key: (count(*))
   Sort Method: external merge  Disk: 3280kB
   Buffers: shared hit=173219 read=87094 written=7, temp read=411 written=411
   ->  GroupAggregate  (cost=0.42..1180479.54 rows=15927 width=14) (actual time=247.546..533.202 rows=39314 loops=1)
         Output: "MMG", "Gene", count(*)
         Buffers: shared hit=173219 read=87094 written=7
         ->  Index Only Scan using mm_gene_idx on public.at_summary_typing  (cost=0.42..1178949.93 rows=182713 width=14) (actual time=247.533..497.771 rows=180657 loops=1)
               Output: "MMG", "Gene"
               Filter: ((at_summary_typing."MMG")::text <> ''::text)
               Rows Removed by Filter: 559071
               Heap Fetches: 739728
               Buffers: shared hit=173219 read=87094 written=7
 Total runtime: 562.735 ms

Performance now comparable with the MySQL. The problem is that I understand that setting this is bad practice and that I should try and find a way to improve my query/encourage the use of the index automatically. However I'm very inexperienced with PostgreSQL and cannot work out how or why it is choosing to use a Seq Scan over an Index Scan in the first place.

I don't have a great experience with postgres but i'll add a new index on MMG as your where condition is working only on this field — TheOni
– TheOni, Commented Jan 5, 2018 at 10:55
That is true but it's the COUNT(*) which is actually slowing the query right down, and it's negated by setting the seqscan off. Setting the index on MMG has negligible effect unfortunately — PyPingu
– PyPingu, Commented Jan 5, 2018 at 11:01
'MMG" is a low-cardinality column? With a lot of NULLs and '' empty values? — joop
– joop, Commented Jan 5, 2018 at 11:03
What is your exact Postgres version (select version() will tell you) — user330315
– user330315, Commented Jan 5, 2018 at 11:24

score 3 · Accepted Answer · 2018-01-05 11:53:09Z

3

why it is choosing to use a Seq Scan over an Index Scan in the first place

Because the seq scan is actually twice as fast as the index scan (224ms vs. 497ms) despite the fact that the index was nearly completely in the cache, but the table was not.

So choosing the seq scan was the right thing to do.

The bottleneck in the first plan is the sorting and grouping that needs to be done on disk.

The better strategy would be to increase work_mem to something more realistic than the really small default of 4MB. You might want to start with something like 16MB, by running

set work_mem=16MB;

before running your query. If that doesn't remove the "Sort Method: external merge Disk" steps, increase it work_mem further.

By increasing the work_mem it also is possible that Postgres switches to a hash aggregate instead of the sorting that it currently does which will probably be faster anyway (but isn't feasible if not enough memory is available)

Once you find a good value, you might want to make that permanent by putting the new value into postgresql.conf

Don't set this too high: that memory may be requested multiple times for each query.

If your where condition is static, you could also create a partial index matching that criteria:

create index on at_summary_typing ("MMG", "Gene") 
where "MMG" <> '';

Don't forget to analyze the table to update the statistics.

edited Jan 5, 2018 at 11:53

answered Jan 5, 2018 at 11:24

user330315

Sign up to request clarification or add additional context in comments.

3 Comments

PyPingu Over a year ago

Thanks, this seems to have done the trick. The default was set to '1MB' and changing this to '2MB' has reduced the query time to ~0.3s! I'm not really sure what the safe limit for this setting is though?

user330315 Over a year ago

@PerlPingu: ah right, 9.3 used 1MB as the default (the current default is 4MB). 8MB or 16MB should be OK if you have a realistic amount of main memory (e.g. 16GB or more) and unless you have hundreds of concurrent queries and each one uses many steps that require that much work_mem

PyPingu Over a year ago

The VM has 8GB, so I'll leave it at 2MB for now perhaps or maybe 4. I don't expect the DB to be hit particularly often for these queries, the django front end is only used by a handful of people right now

Collectives™ on Stack Overflow

PostgreSQL chooses not to use index despite improved performance

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related