I have a query I'm trying to optimize and am running into some surprising/perplexing results.
The tables I'm working with are features and areas, each of which have their own ids and geometries.
Table "features"
Column | Type | Collation | Nullable | Default
-------------+----------+-----------+----------+---------
id | bigint | | not null |
category | text | | not null |
geom | geometry | | not null |
Indexes:
"features_pkey" PRIMARY KEY, btree (id)
"features_category_idx" btree (category)
Table "areas"
Column | Type | Collation | Nullable | Default
-------------+----------+-----------+----------+---------
id | bigint | | not null |
geom | geometry | | not null |
Indexes:
"features_pkey" PRIMARY KEY, btree (id)
The next table stores the many-many relationship between features and areas, with foreign key constraints. Each features may be in zero, one, or many areas (if they're in no areas, they have no entries in the feature_area table) and each area has many features.
Table "feature_area"
Column | Type | Collation | Nullable | Default
--------------+----------+-----------+----------+---------
feature_id | bigint | | not null |
area_id | bigint | | not null |
category | text | | |
Indexes:
"feature_area_pkey" PRIMARY KEY, btree (feature_id, area_id)
"feature_area_category_idx" btree (category)
Foreign-key constraints:
"feature_area_feature_id_fkey" FOREIGN KEY (feature_id) REFERENCES features(feature_id)
"feature_area_area_id_fkey" FOREIGN KEY (area_id) REFERENCES areas(area_id)
What I'm trying to get to is a result like this - all features of category type_x that have fall within any areas:
feature_id | areas | geom
---------------+-------------+-------------
1 | {45,123} | xxxxxx
3 | {8} | xxxxxx
Here's the query I'm working on. It's very slow (~35 seconds).
-- QUERY 1
WITH area_type_x AS (
SELECT
feature_id,
array_agg(area_id) AS areas
FROM feature_area
WHERE category = 'long name for type x'
GROUP BY feature_id
)
SELECT
features.id feature_id,
features.geom,
area_type_x.areas
FROM area_type_x
JOIN features ON features.id = area_type_x.feature_id;
By chance, I tried this, and it's much faster (<3 seconds).
-- QUERY 2
WITH area_type_x AS (
SELECT
feature_id,
array_agg(area_id) AS areas
FROM feature_area
WHERE short_name(category) = 'type_x' -- this line is the only difference
GROUP BY feature_id
)
SELECT
features.id feature_id,
features.geom,
area_type_x.areas
FROM area_type_x
JOIN features ON features.id = area_type_x.feature_id;
I ran each with EXPLAIN ANALYZE and can share those results if it's helpful, but haven't been able to make sense of them myself.
Any idea what's going on? I'd like to figure it out, because I suspect that I may be able to do better than the 3 seconds, if I can skip converting category to its short version, but keep whatever improvements its giving me.
EDIT/UPDATE:
I have some additional information. Based on a handful of articles/SO questions, I temporarily set enable_seqscan=off. This drastically changed the runtime of my query. Query 1 now takes just a little less time to run than Query 2, which makes more sense to me. The change seems to be in the query plan.
Query 1, with enable_seqscan=on, takes >30 seconds:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=114638.13..481900.52 rows=881850 width=169) (actual time=7136.629..33748.990 rows=884251 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Join (cost=113638.13..392715.52 rows=367438 width=169) (actual time=7172.678..32326.371 rows=294750 loops=3)
Hash Cond: (features.id = area_type_x.feature_id)
-> Parallel Seq Scan on features (cost=0.00..159192.80 rows=2589180 width=137) (actual time=12.512..778.814 rows=2057905 loops=3)
-> Hash (cost=95725.00..95725.00 rows=881850 width=40) (actual time=7061.588..7061.591 rows=884251 loops=3)
Buckets: 131072 Batches: 16 Memory Usage: 4915kB
-> Subquery Scan on area_type_x (cost=0.43..95725.00 rows=881850 width=40) (actual time=116.879..1422.959 rows=884251 loops=3)
-> GroupAggregate (cost=0.43..86906.50 rows=881850 width=40) (actual time=116.877..1333.640 rows=884251 loops=3)
Group Key: feature_area.feature_id
-> Index Scan using feature_area_pkey on feature_area (cost=0.43..71341.84 rows=908307 width=16) (actual time=116.808..803.575 rows=905628 loops=3)
Filter: (category = 'long name for type x'::text)
Rows Removed by Filter: 228763
Planning Time: 0.577 ms
JIT:
Functions: 48
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 10.385 ms, Inlining 0.000 ms, Optimization 2.417 ms, Emission 33.150 ms, Total 45.952 ms
Execution Time: 33796.604 ms
(20 rows)
Query 1, with enable_seqscan=off, takes roughly 3 sec:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=0.86..860907.04 rows=881850 width=169) (actual time=1288.916..3362.281 rows=884251 loops=1)
Merge Cond: (feature_area.feature_id = features.id)
-> GroupAggregate (cost=0.43..86906.50 rows=881850 width=40) (actual time=179.721..865.900 rows=884251 loops=1)
Group Key: feature_area.feature_id
-> Index Scan using feature_area_pkey on feature_area (cost=0.43..71341.84 rows=908307 width=16) (actual time=179.686..550.355 rows=905628 loops=1)
Filter: (category = 'long name for type x'::text)
Rows Removed by Filter: 228763
-> Index Scan using features_pkey on features (cost=0.43..738623.83 rows=6214031 width=137) (actual time=0.095..2033.808 rows=5979584 loops=1)
Planning Time: 0.577 ms
JIT:
Functions: 12
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.839 ms, Inlining 10.214 ms, Optimization 67.773 ms, Emission 38.734 ms, Total 118.561 ms
Execution Time: 3389.080 ms
(14 rows)
Turning enable_seqscan off for a production database clearly isn't an option. But I'm not sure where to go from here.