Postgres: optimizing join with a many-many relationship table

Question

I have a query I'm trying to optimize and am running into some surprising/perplexing results.

The tables I'm working with are features and areas, each of which have their own ids and geometries.

                            Table "features"
   Column    |  Type    | Collation | Nullable | Default 
-------------+----------+-----------+----------+---------
 id          | bigint   |           | not null | 
 category    | text     |           | not null | 
 geom        | geometry |           | not null | 

Indexes:
    "features_pkey" PRIMARY KEY, btree (id)
    "features_category_idx" btree (category)

                            Table "areas"
   Column    |  Type    | Collation | Nullable | Default 
-------------+----------+-----------+----------+---------
 id          | bigint   |           | not null | 
 geom        | geometry |           | not null | 

Indexes:
    "features_pkey" PRIMARY KEY, btree (id)

The next table stores the many-many relationship between features and areas, with foreign key constraints. Each features may be in zero, one, or many areas (if they're in no areas, they have no entries in the feature_area table) and each area has many features.


                        Table "feature_area"
   Column     |    Type  | Collation | Nullable | Default 
--------------+----------+-----------+----------+---------
 feature_id   | bigint   |           | not null | 
 area_id      | bigint   |           | not null | 
 category     | text     |           |          | 

Indexes:
    "feature_area_pkey" PRIMARY KEY, btree (feature_id, area_id)
    "feature_area_category_idx" btree (category)

Foreign-key constraints:
    "feature_area_feature_id_fkey" FOREIGN KEY (feature_id) REFERENCES features(feature_id)
    "feature_area_area_id_fkey" FOREIGN KEY (area_id) REFERENCES areas(area_id)

What I'm trying to get to is a result like this - all features of category type_x that have fall within any areas:

  feature_id   |    areas    |    geom    
---------------+-------------+-------------
  1            | {45,123}    | xxxxxx
  3            | {8}         | xxxxxx

Here's the query I'm working on. It's very slow (~35 seconds).

-- QUERY 1

WITH area_type_x AS (
  SELECT 
    feature_id,
    array_agg(area_id) AS areas
  FROM feature_area
  WHERE category = 'long name for type x'
  GROUP BY feature_id
)
SELECT
  features.id feature_id,
  features.geom,
  area_type_x.areas
FROM area_type_x
JOIN features ON features.id = area_type_x.feature_id;

By chance, I tried this, and it's much faster (<3 seconds).

-- QUERY 2

WITH area_type_x AS (
  SELECT 
    feature_id,
    array_agg(area_id) AS areas
  FROM feature_area
  WHERE short_name(category) = 'type_x' -- this line is the only difference
  GROUP BY feature_id
)
SELECT
  features.id feature_id,
  features.geom,
  area_type_x.areas
FROM area_type_x
JOIN features ON features.id = area_type_x.feature_id;

I ran each with EXPLAIN ANALYZE and can share those results if it's helpful, but haven't been able to make sense of them myself.

Any idea what's going on? I'd like to figure it out, because I suspect that I may be able to do better than the 3 seconds, if I can skip converting category to its short version, but keep whatever improvements its giving me.

EDIT/UPDATE:

I have some additional information. Based on a handful of articles/SO questions, I temporarily set enable_seqscan=off. This drastically changed the runtime of my query. Query 1 now takes just a little less time to run than Query 2, which makes more sense to me. The change seems to be in the query plan.

Query 1, with enable_seqscan=on, takes >30 seconds:

                                                                                     QUERY PLAN                                                                                      
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=114638.13..481900.52 rows=881850 width=169) (actual time=7136.629..33748.990 rows=884251 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Hash Join  (cost=113638.13..392715.52 rows=367438 width=169) (actual time=7172.678..32326.371 rows=294750 loops=3)
         Hash Cond: (features.id = area_type_x.feature_id)
         ->  Parallel Seq Scan on features  (cost=0.00..159192.80 rows=2589180 width=137) (actual time=12.512..778.814 rows=2057905 loops=3)
         ->  Hash  (cost=95725.00..95725.00 rows=881850 width=40) (actual time=7061.588..7061.591 rows=884251 loops=3)
               Buckets: 131072  Batches: 16  Memory Usage: 4915kB
               ->  Subquery Scan on area_type_x  (cost=0.43..95725.00 rows=881850 width=40) (actual time=116.879..1422.959 rows=884251 loops=3)
                     ->  GroupAggregate  (cost=0.43..86906.50 rows=881850 width=40) (actual time=116.877..1333.640 rows=884251 loops=3)
                           Group Key: feature_area.feature_id
                           ->  Index Scan using feature_area_pkey on feature_area  (cost=0.43..71341.84 rows=908307 width=16) (actual time=116.808..803.575 rows=905628 loops=3)
                                 Filter: (category = 'long name for type x'::text)
                                 Rows Removed by Filter: 228763
 Planning Time: 0.577 ms
 JIT:
   Functions: 48
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 10.385 ms, Inlining 0.000 ms, Optimization 2.417 ms, Emission 33.150 ms, Total 45.952 ms
 Execution Time: 33796.604 ms
(20 rows)

Query 1, with enable_seqscan=off, takes roughly 3 sec:

                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=0.86..860907.04 rows=881850 width=169) (actual time=1288.916..3362.281 rows=884251 loops=1)
   Merge Cond: (feature_area.feature_id = features.id)
   ->  GroupAggregate  (cost=0.43..86906.50 rows=881850 width=40) (actual time=179.721..865.900 rows=884251 loops=1)
         Group Key: feature_area.feature_id
         ->  Index Scan using feature_area_pkey on feature_area  (cost=0.43..71341.84 rows=908307 width=16) (actual time=179.686..550.355 rows=905628 loops=1)
               Filter: (category = 'long name for type x'::text)
               Rows Removed by Filter: 228763
   ->  Index Scan using features_pkey on features  (cost=0.43..738623.83 rows=6214031 width=137) (actual time=0.095..2033.808 rows=5979584 loops=1)
 Planning Time: 0.577 ms
 JIT:
   Functions: 12
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.839 ms, Inlining 10.214 ms, Optimization 67.773 ms, Emission 38.734 ms, Total 118.561 ms
 Execution Time: 3389.080 ms
(14 rows)

Turning enable_seqscan off for a production database clearly isn't an option. But I'm not sure where to go from here.

Bohemian · Accepted Answer · 2024-01-24 20:55:22Z

0

Don't use a CTE:

SELECT
  features.id feature_id,
  features.geom,
  array_agg(area_id) AS areas
FROM feature_area
JOIN features ON features.id = feature_area.feature_id
WHERE category = 'long name for type x'
GROUP BY 1, 2

The problem with the CTE is it calculates the aggregation of areas for all features, but calculating the aggregation after the join only does that work for the features you need.

Also, the result of the CTE has to live somewhere and storing a large amount of data (memory, possibly temporarily disk space) is another source of query expense.

edited Jan 24, 2024 at 20:55

answered Jan 24, 2024 at 20:48

Bohemian♦

427k103 gold badges603 silver badges750 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ebbishop Over a year ago

This query has pretty similar performance to the very slow one I posted in my question. Your reasoning makes perfect sense to me and I'm completely perplexed by why it would be so slow. If I replace the WHERE clause from your query with WHERE short_name(categeory) = 'type_x', I get much improved performance. Which doesn't make sense to me, since there's actually an index on the category field in both features and feature_area.

Bohemian Over a year ago

Are you sure you didn't define a functional index on short_name(category)? Can you alter the category column to be type varchar(n) (making n small as possible)? Then redefine the index on it?

Collectives™ on Stack Overflow

Postgres: optimizing join with a many-many relationship table

EDIT/UPDATE:

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

EDIT/UPDATE:

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related