0

I have a table, dev_base_low which stores that have several fields but for this example we will focus on the key fields, from, to, carrier, type, v_type, goods_category. The table contains 4,000 records and are indexed on the key fields.

The objective is simple, under the column type I want to join the spokes - hub - spokes. My first attempt is to join all spokes to hubs with the following:

WITH 
    temp_hub as 
    (SELECT * FROM dev_base_low WHERE carrier_type = 'hub')

SELECT
    1
FROM 
  (select * from temp_hub where type = 'spoke-hub') leg1 
    INNER JOIN (select * from temp_hub where type = 'hub-hub') leg2 
        on  
                        leg1.to = leg2.from and
                        leg1.carrier = leg2.carrier and
                        leg1.from <> leg2.to and
                        leg1.v_type = leg2.v_type and
                        leg1.goods_category = leg2.goods_category 

The query as we can imagine runs optimally using hash joins under a 1 second, which outputs 8,000 records. EXPLAIN ANALYZE output, to my knowledge from below, indexes should not have a big difference given that it is a hash join and also because the total records are in no ways large:

QUERY PLAN
Hash Join  (cost=224.21..308.71 rows=1 width=4) (actual time=6.369..15.949 rows=8982 loops=1)
  Hash Cond: ((temp_hub.to = temp_hub_1.from) AND (temp_hub.carrier = temp_hub_1.carrier) AND (temp_hub.v_type = temp_hub_1.v_type) AND (temp_hub.goods_category = temp_hub_1.goods_category))
  Join Filter: (temp_hub.from <> temp_hub_1.to)
  Rows Removed by Join Filter: 4
  CTE temp_hub
    ->  Seq Scan on dev_base_low  (cost=0.00..139.72 rows=3738 width=155) (actual time=0.020..2.094 rows=3738 loops=1)
          Filter: (carrier_type = 'hub'::text)
  ->  CTE Scan on temp_hub  (cost=0.00..84.11 rows=19 width=160) (actual time=0.027..1.514 rows=1240 loops=1)
        Filter: (type = 'spoke-hub'::text)
        Rows Removed by Filter: 2498
  ->  Hash  (cost=84.11..84.11 rows=19 width=160) (actual time=6.313..6.314 rows=1088 loops=1)
        Buckets: 2048 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 107kB
        ->  CTE Scan on temp_hub temp_hub_1  (cost=0.00..84.11 rows=19 width=160) (actual time=0.011..5.106 rows=1088 loops=1)
              Filter: (type = 'hub-hub'::text)
              Rows Removed by Filter: 2650
Planning Time: 1.871 ms
Execution Time: 16.928 ms

The problem occurs when a second self-inner-join is added and the query times out, unfortunately, I have a hard timeout on my client-side at 20 seconds. Shown below:

SELECT
    1
FROM 
  (select * from temp_hub where type = 'spoke-hub') leg1 
    INNER JOIN (select * from temp_hub where type = 'hub-hub') leg2 
        on  
                        leg1.to = leg2.from and
                        leg1.carrier = leg2.carrier and
                        leg1.from <> leg2.to and
                        leg1.v_type = leg2.v_type and
                        leg1.goods_category = leg2.goods_category 
  INNER JOIN (select * from temp_hub where type = 'hub-spoke' ) leg3
        on  leg2.to = leg3.from and
                        leg2.carrier = leg3.carrier and
                        leg1.from <> leg3.to and
                        leg2.from <> leg3.to and
                        leg2.v_type = leg3.v_type and
                        leg2.goods_category = leg3.goods_category

I have tried several optimizations with indexing, using sub-queries and CTEs, using different join methods (hash,nested,merge) and checking DB configs on memory allocation but with no real benefits. I have estimated the total output with the 2nd inner join to be under 400,000 records.

My questions:

  1. Is there anything wrong with the query joins or methods?
  2. Are there any optimizations on troubleshooting query performance that I can run?
  3. Even given 2 tables, with 8,000 and 4,000 records respectively, is there anything I can do to ensure that the runtime remains below 20 seconds?

EDIT

So even after increasing timeout, it timesout at 2 minutes, I know there is an option to increase timeout but I guess that defeats the purpose. Added the EXPLAIN for the query below:

Nested Loop  (cost=224.16..393.19 rows=1 width=4)
"  Join Filter: ((temp_hub.""from"" <> temp_hub_1.""to"") AND (temp_hub_1.""from"" <> temp_hub_2.""to"") AND (temp_hub.""to"" = temp_hub_1.""from"") AND (temp_hub.carrier = temp_hub_1.carrier) AND (temp_hub.v_type = temp_hub_1.v_type) AND (temp_hub.goods_category = temp_hub_1.goods_category) AND (temp_hub_2.""from"" = temp_hub_1.""to""))"
  CTE temp_hub
    ->  Seq Scan on dev_base_low  (cost=0.00..139.72 rows=3738 width=155)
          Filter: (carrier_type = 'hub'::text)
  ->  Hash Join  (cost=84.44..168.84 rows=1 width=320)
        Hash Cond: ((temp_hub.carrier = temp_hub_2.carrier) AND (temp_hub.v_type = temp_hub_2.v_type) AND (temp_hub.goods_category = temp_hub_2.goods_category))
"        Join Filter: (temp_hub.""from"" <> temp_hub_2.""to"")"
        ->  CTE Scan on temp_hub  (cost=0.00..84.11 rows=19 width=160)
              Filter: (type = 'spoke-hub'::text)
        ->  Hash  (cost=84.11..84.11 rows=19 width=160)
              ->  CTE Scan on temp_hub temp_hub_2  (cost=0.00..84.11 rows=19 width=160)
                    Filter: (type = 'hub-spoke'::text)
  ->  CTE Scan on temp_hub temp_hub_1  (cost=0.00..84.11 rows=19 width=160)
        Filter: (type = 'hub-hub'::text)
8
  • I'd put some of those conditions (in particular the <>) in a where clause not the join condition, but I have no idea whether that'll make a difference for planning Commented Feb 16, 2024 at 10:25
  • Please update your question with the EXPLAIN output for the second query. Obviously don't use the ANALYZE option so that the query is not executed Commented Feb 16, 2024 at 10:35
  • Will update the EXPLAIN output, but are there any usual suspects? Commented Feb 16, 2024 at 10:41
  • I would imagine two indexes (type, carrier, goods_category, to) include (from) and another (type, carrier, goods_category, from) include (to) would sort this out. Commented Feb 16, 2024 at 11:13
  • @Charlieface would you think type is required, I know it’s in the where clause but it’s small. Commented Feb 16, 2024 at 11:23

1 Answer 1

0

After a small change, the query now runs under a second.

Nested Loop  (cost=0.58..1804.05 rows=43262 width=4) (actual time=0.235..926.969 rows=310362 loops=1)
"  Join Filter: ((dev_base.""from"" <> dev_base_2.""to"") AND (dev_base.carrier = dev_base_2.carrier) AND (dev_base.v_type = dev_base_2.v_type) AND (dev_base.goods_category = dev_base_2.goods_category))"
  Rows Removed by Join Filter: 2564
  ->  Nested Loop  (cost=0.29..941.32 rows=7986 width=93) (actual time=0.194..493.947 rows=23818 loops=1)
        ->  Seq Scan on dev_base  (cost=0.00..297.30 rows=2031 width=53) (actual time=0.128..4.099 rows=1778 loops=1)
              Filter: ((carrier_type = 'hub'::text) AND (type = 'spoke-hub'::text))
              Rows Removed by Filter: 5842
        ->  Memoize  (cost=0.29..0.82 rows=1 width=53) (actual time=0.039..0.269 rows=13 loops=1778)
"              Cache Key: dev_base.""from"", dev_base.""to"", dev_base.carrier, dev_base.v_type, dev_base.goods_category"
              Hits: 0  Misses: 1778  Evictions: 0  Overflows: 0  Memory Usage: 2164kB
              ->  Index Scan using idx_base_routes on dev_base dev_base_1  (cost=0.28..0.81 rows=1 width=53) (actual time=0.035..0.253 rows=13 loops=1778)
"                    Index Cond: ((""from"" = dev_base.""to"") AND (carrier = dev_base.carrier) AND (v_type = dev_base.v_type) AND (goods_category = dev_base.goods_category))"
"                    Filter: ((carrier_type = 'hub'::text) AND (type = 'hub-hub'::text) AND (dev_base.""from"" <> ""to""))"
                    Rows Removed by Filter: 36
  ->  Memoize  (cost=0.29..0.88 rows=1 width=53) (actual time=0.004..0.009 rows=13 loops=23818)
"        Cache Key: dev_base_1.""from"", dev_base_1.carrier, dev_base_1.v_type, dev_base_1.goods_category, dev_base_1.""to"""
        Hits: 22644  Misses: 1174  Evictions: 0  Overflows: 0  Memory Usage: 1070kB
        ->  Index Scan using idx_base_routes on dev_base dev_base_2  (cost=0.28..0.87 rows=1 width=53) (actual time=0.049..0.127 rows=10 loops=1174)
"              Index Cond: ((""from"" = dev_base_1.""to"") AND (carrier = dev_base_1.carrier) AND (v_type = dev_base_1.v_type) AND (goods_category = dev_base_1.goods_category))"
"              Filter: ((carrier_type = 'hub'::text) AND (type = 'hub-spoke'::text) AND (dev_base_1.""from"" <> ""to""))"
              Rows Removed by Filter: 11
Planning Time: 3.428 ms
Execution Time: 943.901 ms

In summary, indexes are VERY important during joins and in Postgres, using a CTE with the WITH statement might materialize and not have the benefits of its index. For join cases like mine, it is much more effecient if it is written as;

WITH
    temp_hub AS NOT MATERIALIZED
    (SELECT * FROM dev_base WHERE carrier_type = 'hub')

This is also recommended under docs: https://www.postgresql.org/docs/current/queries-with.html with the caveat that your WITH statement does not contain any computationally expensive operations.

The comment from @Charlieface pointed this to the right direction.

Sign up to request clarification or add additional context in comments.

1 Comment

In your answer, you reference dev_base, but in the original post the table was dev_base_low. Are these the same tables? Also, please include the improved query instead of just stating that a small change to the original query resolved the performance issue. For the query presented, I would eliminate the CTE and subqueries and instead select directly against dev_base with carrier_type = 'hub' added to the WHERE clause and adding a comparison to leg1.carrier_type to each of the self-join expressions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.