I had a DB in MySQL and am in the process of moving to PostgreSQL with a Django front-end.
I have a table of 650k-750k rows on which I perform the following query:
SELECT "MMG", "Gene", COUNT(*) FROM at_summary_typing WHERE "MMG" != '' GROUP BY "MMG", "Gene" ORDER BY COUNT(*);
In the MySQL this returns in ~0.5s. However when I switched to PostgreSQL the same query takes ~3s. I have put an index on MMG and Gene together to try and speed it up but when using EXPLAIN (analyse, buffers, verbose) I see the output shows the index is not used :
Sort (cost=59013.54..59053.36 rows=15927 width=14) (actual time=2880.222..2885.475 rows=39314 loops=1)
Output: "MMG", "Gene", (count(*))
Sort Key: (count(*))
Sort Method: external merge Disk: 3280kB
Buffers: shared hit=16093 read=11482, temp read=2230 written=2230
-> GroupAggregate (cost=55915.50..57901.90 rows=15927 width=14) (actual time=2179.809..2861.679 rows=39314 loops=1)
Output: "MMG", "Gene", count(*)
Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
-> Sort (cost=55915.50..56372.29 rows=182713 width=14) (actual time=2179.782..2830.232 rows=180657 loops=1)
Output: "MMG", "Gene"
Sort Key: at_summary_typing."MMG", at_summary_typing."Gene"
Sort Method: external merge Disk: 8168kB
Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
-> Seq Scan on public.at_summary_typing (cost=0.00..36821.60 rows=182713 width=14) (actual time=0.010..224.658 rows=180657 loops=1)
Output: "MMG", "Gene"
Filter: ((at_summary_typing."MMG")::text <> ''::text)
Rows Removed by Filter: 559071
Buffers: shared hit=16093 read=11482
Total runtime: 2888.804 ms
After some searching I found that I could force the use of the index by setting SET enable_seqscan = OFF; and the EXPLAIN now shows the following :
Sort (cost=1181591.18..1181631.00 rows=15927 width=14) (actual time=555.546..560.839 rows=39314 loops=1)
Output: "MMG", "Gene", (count(*))
Sort Key: (count(*))
Sort Method: external merge Disk: 3280kB
Buffers: shared hit=173219 read=87094 written=7, temp read=411 written=411
-> GroupAggregate (cost=0.42..1180479.54 rows=15927 width=14) (actual time=247.546..533.202 rows=39314 loops=1)
Output: "MMG", "Gene", count(*)
Buffers: shared hit=173219 read=87094 written=7
-> Index Only Scan using mm_gene_idx on public.at_summary_typing (cost=0.42..1178949.93 rows=182713 width=14) (actual time=247.533..497.771 rows=180657 loops=1)
Output: "MMG", "Gene"
Filter: ((at_summary_typing."MMG")::text <> ''::text)
Rows Removed by Filter: 559071
Heap Fetches: 739728
Buffers: shared hit=173219 read=87094 written=7
Total runtime: 562.735 ms
Performance now comparable with the MySQL. The problem is that I understand that setting this is bad practice and that I should try and find a way to improve my query/encourage the use of the index automatically. However I'm very inexperienced with PostgreSQL and cannot work out how or why it is choosing to use a Seq Scan over an Index Scan in the first place.
''empty values?select version()will tell you)