2

I have following query:

SELECT
   Sum(fact_individual_re.quality_hours) AS C0,
   dim_gender.name AS C1,
   dim_date.year AS C2
FROM
   fact_individual_re
   INNER JOIN dim_date ON fact_individual_re.dim_date_id = dim_date.id
   INNER JOIN dim_gender ON fact_individual_re.dim_gender_id = dim_gender.id
GROUP BY dim_date.year,dim_gender.name
ORDER BY dim_date.year ASC,dim_gender.name ASC,Sum(fact_individual_re.quality_hours) ASC

When explaining it's plan, HASH JOIN is taking most time. Is there any way to minimize the time for HASH JOIN:

Sort  (cost=190370.50..190370.55 rows=20 width=18) (actual time=4005.152..4005.154 rows=20 loops=1)
   Sort Key: dim_date.year, dim_gender.name, (sum(fact_individual_re.quality_hours))
   Sort Method: quicksort  Memory: 26kB
   ->  Finalize GroupAggregate  (cost=190369.07..190370.07 rows=20 width=18) (actual time=4005.106..4005.135 rows=20 loops=1)
         Group Key: dim_date.year, dim_gender.name
         ->  Sort  (cost=190369.07..190369.27 rows=80 width=18) (actual time=4005.100..4005.103 rows=100 loops=1)
               Sort Key: dim_date.year, dim_gender.name
               Sort Method: quicksort  Memory: 32kB
               ->  Gather  (cost=190358.34..190366.54 rows=80 width=18) (actual time=4004.966..4005.020 rows=100 loops=1)
                     Workers Planned: 4
                     Workers Launched: 4
                     ->  Partial HashAggregate  (cost=189358.34..189358.54 rows=20 width=18) (actual time=3885.254..3885.259 rows=20 loops=5)
                           Group Key: dim_date.year, dim_gender.name
                           ->  Hash Join  (cost=125.17..170608.34 rows=2500000 width=14) (actual time=2.279..2865.808 rows=2000000 loops=5)
                                 Hash Cond: (fact_individual_re.dim_gender_id = dim_gender.id)
                                 ->  Hash Join  (cost=124.13..150138.54 rows=2500000 width=12) (actual time=2.060..2115.234 rows=2000000 loops=5)
                                       Hash Cond: (fact_individual_re.dim_date_id = dim_date.id)
                                       ->  Parallel Seq Scan on fact_individual_re  (cost=0.00..118458.00 rows=2500000 width=12) (actual time=0.204..982.810 rows=2000000 loops=5)
                                       ->  Hash  (cost=78.50..78.50 rows=3650 width=8) (actual time=1.824..1.824 rows=3650 loops=5)
                                             Buckets: 4096  Batches: 1  Memory Usage: 175kB
                                             ->  Seq Scan on dim_date  (cost=0.00..78.50 rows=3650 width=8) (actual time=0.143..1.030 rows=3650 loops=5)
                                 ->  Hash  (cost=1.02..1.02 rows=2 width=10) (actual time=0.193..0.193 rows=2 loops=5)
                                       Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                       ->  Seq Scan on dim_gender  (cost=0.00..1.02 rows=2 width=10) (actual time=0.181..0.182 rows=2 loops=5)
 Planning time: 0.609 ms
 Execution time: 4020.423 ms
(26 rows)

I am using postgresql v10.

3
  • Join's from fact tables to dimension tables would not normally use a hash join, because proper indexes would be set up. Commented Jan 22, 2018 at 16:15
  • I already have indices on dim_date.id and dim_gender.id Commented Jan 22, 2018 at 16:23
  • . . I find the hash join puzzling. Commented Jan 23, 2018 at 2:07

2 Answers 2

6

I'd recommend to partially group the rows before the join:

select
  sum(quality_hours_sum) AS C0,
  dim_gender.name AS C1,
  dim_date.year AS C2
from 
  (
    select
      sum(quality_hours) as quality_hours_sum,
      dim_date_id,
      dim_gender_id
    from fact_individual_re
    group by dim_date_id, dim_gender_id
  ) as fact_individual_re_sum
  join dim_date on dim_date_id = dim_date.id
  join dim_gender on dim_gender_id = dim_gender.id
group by dim_date.year, dim_gender.name
order by dim_date.year, dim_gender.name, 0;

This way you will be joining only 1460 rows (count(distinct dim_date_id)*count(distint dim_gender_id)) instead of all 2M rows. Although it would still need to read and group all 2M rows - to avoid that you'd need something like summary table maintained with a trigger.

Sign up to request clarification or add additional context in comments.

Comments

0

There is no predicate shown on the fact table, so we can assume prior to the filtering via the joins 100% of that table is required.

The indexes exist on the lookup tables, but are not covering indexes from what you say. Given 100% of the fact table being scanned, combined with the index not being covering, I would expect it to hash join.

As an experiment, you could apply a covering index (index the dim_date.date_id and dim_date.year in a single index) to see if it swaps off a hash join against dim_date.

With the overall lack of predicates though - outside of a covering index, a hash join is not necessarily the wrong query plan.

2 Comments

created covering index, but no impact on query plan. I agree with your last statement that hash join is not necessarily the wrong plan without lack of predicates. My question is can't we do anything to improve speed of hash join?
If a covering index is not being chosen for the plan, then I suspect not - it should at least hash join against the index, instead of the table if its a covering index, which would make it quicker to read to construct the hash table, but if its a lookup table, its likely to be small enough to not matter.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.