2

I'm using postgres to run this simple query on Data base of tpc-ds 1TB:

select cp_start_date_sk, ws_sold_date_sk
from catalog_page, web_sales
where ws_sold_date_sk = cp_start_date_sk;

the query plan is:

                                     QUERY PLAN
------------------------------------------------------------------------------------
    Hash Join  (cost=34981958.72..381924075.14 rows=24611155542 width=8)
    Hash Cond: (catalog_page.cp_start_date_sk = web_sales.ws_sold_date_sk)
    ->  Seq Scan on catalog_page  (cost=0.00..1836.00 rows=60000 width=4)
     ->  Hash  (cost=25981508.32..25981508.32 rows=720036032 width=4)
         ->  Seq Scan on web_sales  (cost=0.00..25981508.32 rows=720036032 width=4)

as it can be seem the big table is inserted into hash. Allegedly this is not optimal because the hash will be bigger and the time to run it will be slower than build the hash from the small table. Can anyone explain that?

update: the schema: enter image description here

enter image description here

update2: psql version: psql (PostgreSQL) 14beta1

config definitions:

max_worker_processes = 1

max_parallel_workers = 1

max_parallel_workers_per_gather = 1

shared_buffers = 125GB

effective_cache_size = 250GB

work_mem = 125GB
10
  • 1
    can you add the create table + create index statements? The indexes aren't being used at all ... Commented Jul 12, 2021 at 14:35
  • Nothing is "inserted into" - the hashtable to do the join is build up in memory. If there isn't enough memory to do that, Postgres will switch to a different join strategy. Commented Jul 12, 2021 at 15:05
  • 3
    Totally unrelated, but: you might want to start using the "modern" explicit JOIN operator that was introduced in SQL nearly 30 years ago, rather than the implicit and fragile join conditions in the WHERE claause Commented Jul 12, 2021 at 15:06
  • 1
    You need to show us the table and index definitions, as well as row counts for each of the tables. Maybe your tables are defined poorly. Maybe the indexes aren't created correctly. Maybe you don't have an index on that column you thought you did. Without seeing the table and index definitions, we can't tell. We need row counts because that can affect query planning. If you know how to do an EXPLAIN or get an execution plan, put the results in the question as well. If you have no indexes, visit use-the-index-luke.com. Commented Jul 12, 2021 at 15:08
  • I created the table with "tpcds.sql" file and the relation with "tpcds_ri.sql" file that given with tpc-ds query here github.com/gregrahn/tpcds-kit/tree/master/tools. the schema of the tables is in updated quation Commented Jul 12, 2021 at 15:39

1 Answer 1

1

There are no indexes in the columns used in your query. Add the following indexes and the query time should be improved (check the query plan again):

CREATE INDEX idx_cat ON catalog_page (cp_start_date_sk);
CREATE INDEX idx_ws ON web_sales (ws_sold_date_sk);

Also, try to use proper JOINs instead of using multiple tables in the WHERE clause - the latter might give you a huge headache if you accidentally cross join multiple large tables.

SELECT cp_start_date_sk, ws_sold_date_sk
FROM catalog_page c 
JOIN web_sales w ON w.ws_sold_date_sk = c.cp_start_date_sk;

Demo: db<>fiddle

Sign up to request clarification or add additional context in comments.

2 Comments

Actually I checked the explicit join, even replaced between the tables and always got the same plan. I didn't try to use indexes because I don't try to fix the problem, but to realize why postgres does such inefficient plan
In fact, a supporting index is needed for all FK fields. (otherwise, updates or deletes in the referred table would be very costly) Plus: do an ANALYZE on the table after creating the index.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.