I'm using postgres to run this simple query on Data base of tpc-ds 1TB:
select cp_start_date_sk, ws_sold_date_sk
from catalog_page, web_sales
where ws_sold_date_sk = cp_start_date_sk;
the query plan is:
QUERY PLAN
------------------------------------------------------------------------------------
Hash Join (cost=34981958.72..381924075.14 rows=24611155542 width=8)
Hash Cond: (catalog_page.cp_start_date_sk = web_sales.ws_sold_date_sk)
-> Seq Scan on catalog_page (cost=0.00..1836.00 rows=60000 width=4)
-> Hash (cost=25981508.32..25981508.32 rows=720036032 width=4)
-> Seq Scan on web_sales (cost=0.00..25981508.32 rows=720036032 width=4)
as it can be seem the big table is inserted into hash. Allegedly this is not optimal because the hash will be bigger and the time to run it will be slower than build the hash from the small table. Can anyone explain that?
update2: psql version: psql (PostgreSQL) 14beta1
config definitions:
max_worker_processes = 1
max_parallel_workers = 1
max_parallel_workers_per_gather = 1
shared_buffers = 125GB
effective_cache_size = 250GB
work_mem = 125GB


EXPLAINor get an execution plan, put the results in the question as well. If you have no indexes, visit use-the-index-luke.com.