I have a table, dev_base_low which stores that have several fields but for this example we will focus on the key fields, from, to, carrier, type, v_type, goods_category. The table contains 4,000 records and are indexed on the key fields.
The objective is simple, under the column type I want to join the spokes - hub - spokes. My first attempt is to join all spokes to hubs with the following:
WITH
temp_hub as
(SELECT * FROM dev_base_low WHERE carrier_type = 'hub')
SELECT
1
FROM
(select * from temp_hub where type = 'spoke-hub') leg1
INNER JOIN (select * from temp_hub where type = 'hub-hub') leg2
on
leg1.to = leg2.from and
leg1.carrier = leg2.carrier and
leg1.from <> leg2.to and
leg1.v_type = leg2.v_type and
leg1.goods_category = leg2.goods_category
The query as we can imagine runs optimally using hash joins under a 1 second, which outputs 8,000 records. EXPLAIN ANALYZE output, to my knowledge from below, indexes should not have a big difference given that it is a hash join and also because the total records are in no ways large:
QUERY PLAN
Hash Join (cost=224.21..308.71 rows=1 width=4) (actual time=6.369..15.949 rows=8982 loops=1)
Hash Cond: ((temp_hub.to = temp_hub_1.from) AND (temp_hub.carrier = temp_hub_1.carrier) AND (temp_hub.v_type = temp_hub_1.v_type) AND (temp_hub.goods_category = temp_hub_1.goods_category))
Join Filter: (temp_hub.from <> temp_hub_1.to)
Rows Removed by Join Filter: 4
CTE temp_hub
-> Seq Scan on dev_base_low (cost=0.00..139.72 rows=3738 width=155) (actual time=0.020..2.094 rows=3738 loops=1)
Filter: (carrier_type = 'hub'::text)
-> CTE Scan on temp_hub (cost=0.00..84.11 rows=19 width=160) (actual time=0.027..1.514 rows=1240 loops=1)
Filter: (type = 'spoke-hub'::text)
Rows Removed by Filter: 2498
-> Hash (cost=84.11..84.11 rows=19 width=160) (actual time=6.313..6.314 rows=1088 loops=1)
Buckets: 2048 (originally 1024) Batches: 1 (originally 1) Memory Usage: 107kB
-> CTE Scan on temp_hub temp_hub_1 (cost=0.00..84.11 rows=19 width=160) (actual time=0.011..5.106 rows=1088 loops=1)
Filter: (type = 'hub-hub'::text)
Rows Removed by Filter: 2650
Planning Time: 1.871 ms
Execution Time: 16.928 ms
The problem occurs when a second self-inner-join is added and the query times out, unfortunately, I have a hard timeout on my client-side at 20 seconds. Shown below:
SELECT
1
FROM
(select * from temp_hub where type = 'spoke-hub') leg1
INNER JOIN (select * from temp_hub where type = 'hub-hub') leg2
on
leg1.to = leg2.from and
leg1.carrier = leg2.carrier and
leg1.from <> leg2.to and
leg1.v_type = leg2.v_type and
leg1.goods_category = leg2.goods_category
INNER JOIN (select * from temp_hub where type = 'hub-spoke' ) leg3
on leg2.to = leg3.from and
leg2.carrier = leg3.carrier and
leg1.from <> leg3.to and
leg2.from <> leg3.to and
leg2.v_type = leg3.v_type and
leg2.goods_category = leg3.goods_category
I have tried several optimizations with indexing, using sub-queries and CTEs, using different join methods (hash,nested,merge) and checking DB configs on memory allocation but with no real benefits. I have estimated the total output with the 2nd inner join to be under 400,000 records.
My questions:
- Is there anything wrong with the query joins or methods?
- Are there any optimizations on troubleshooting query performance that I can run?
- Even given 2 tables, with 8,000 and 4,000 records respectively, is there anything I can do to ensure that the runtime remains below 20 seconds?
EDIT
So even after increasing timeout, it timesout at 2 minutes, I know there is an option to increase timeout but I guess that defeats the purpose. Added the EXPLAIN for the query below:
Nested Loop (cost=224.16..393.19 rows=1 width=4)
" Join Filter: ((temp_hub.""from"" <> temp_hub_1.""to"") AND (temp_hub_1.""from"" <> temp_hub_2.""to"") AND (temp_hub.""to"" = temp_hub_1.""from"") AND (temp_hub.carrier = temp_hub_1.carrier) AND (temp_hub.v_type = temp_hub_1.v_type) AND (temp_hub.goods_category = temp_hub_1.goods_category) AND (temp_hub_2.""from"" = temp_hub_1.""to""))"
CTE temp_hub
-> Seq Scan on dev_base_low (cost=0.00..139.72 rows=3738 width=155)
Filter: (carrier_type = 'hub'::text)
-> Hash Join (cost=84.44..168.84 rows=1 width=320)
Hash Cond: ((temp_hub.carrier = temp_hub_2.carrier) AND (temp_hub.v_type = temp_hub_2.v_type) AND (temp_hub.goods_category = temp_hub_2.goods_category))
" Join Filter: (temp_hub.""from"" <> temp_hub_2.""to"")"
-> CTE Scan on temp_hub (cost=0.00..84.11 rows=19 width=160)
Filter: (type = 'spoke-hub'::text)
-> Hash (cost=84.11..84.11 rows=19 width=160)
-> CTE Scan on temp_hub temp_hub_2 (cost=0.00..84.11 rows=19 width=160)
Filter: (type = 'hub-spoke'::text)
-> CTE Scan on temp_hub temp_hub_1 (cost=0.00..84.11 rows=19 width=160)
Filter: (type = 'hub-hub'::text)
<>) in awhereclause not the join condition, but I have no idea whether that'll make a difference for planning(type, carrier, goods_category, to) include (from)and another(type, carrier, goods_category, from) include (to)would sort this out.