Postgres index the big table for hashing

Question

I'm using postgres to run this simple query on Data base of tpc-ds 1TB:

select cp_start_date_sk, ws_sold_date_sk
from catalog_page, web_sales
where ws_sold_date_sk = cp_start_date_sk;

the query plan is:

                                     QUERY PLAN
------------------------------------------------------------------------------------
    Hash Join  (cost=34981958.72..381924075.14 rows=24611155542 width=8)
    Hash Cond: (catalog_page.cp_start_date_sk = web_sales.ws_sold_date_sk)
    ->  Seq Scan on catalog_page  (cost=0.00..1836.00 rows=60000 width=4)
     ->  Hash  (cost=25981508.32..25981508.32 rows=720036032 width=4)
         ->  Seq Scan on web_sales  (cost=0.00..25981508.32 rows=720036032 width=4)

as it can be seem the big table is inserted into hash. Allegedly this is not optimal because the hash will be bigger and the time to run it will be slower than build the hash from the small table. Can anyone explain that?

update: the schema:

update2: psql version: psql (PostgreSQL) 14beta1

config definitions:

max_worker_processes = 1

max_parallel_workers = 1

max_parallel_workers_per_gather = 1

shared_buffers = 125GB

effective_cache_size = 250GB

work_mem = 125GB

can you add the create table + create index statements? The indexes aren't being used at all ... — Jim Jones
– Jim Jones, Commented Jul 12, 2021 at 14:35
Nothing is "inserted into" - the hashtable to do the join is build up in memory. If there isn't enough memory to do that, Postgres will switch to a different join strategy. — user330315
– user330315, Commented Jul 12, 2021 at 15:05
Totally unrelated, but: you might want to start using the "modern" explicit JOIN operator that was introduced in SQL nearly 30 years ago, rather than the implicit and fragile join conditions in the WHERE claause — user330315
– user330315, Commented Jul 12, 2021 at 15:06
You need to show us the table and index definitions, as well as row counts for each of the tables. Maybe your tables are defined poorly. Maybe the indexes aren't created correctly. Maybe you don't have an index on that column you thought you did. Without seeing the table and index definitions, we can't tell. We need row counts because that can affect query planning. If you know how to do an EXPLAIN or get an execution plan, put the results in the question as well. If you have no indexes, visit use-the-index-luke.com. — Andy Lester
– Andy Lester, Commented Jul 12, 2021 at 15:08
I created the table with "tpcds.sql" file and the relation with "tpcds_ri.sql" file that given with tpc-ds query here github.com/gregrahn/tpcds-kit/tree/master/tools. the schema of the tables is in updated quation — etiel
– etiel, Commented Jul 12, 2021 at 15:39

Jim Jones · Accepted Answer · 2021-07-12 16:15:55Z

1

There are no indexes in the columns used in your query. Add the following indexes and the query time should be improved (check the query plan again):

CREATE INDEX idx_cat ON catalog_page (cp_start_date_sk);
CREATE INDEX idx_ws ON web_sales (ws_sold_date_sk);

Also, try to use proper JOINs instead of using multiple tables in the WHERE clause - the latter might give you a huge headache if you accidentally cross join multiple large tables.

SELECT cp_start_date_sk, ws_sold_date_sk
FROM catalog_page c 
JOIN web_sales w ON w.ws_sold_date_sk = c.cp_start_date_sk;

Demo: db<>fiddle

answered Jul 12, 2021 at 16:15

Jim Jones

19.9k3 gold badges44 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

etiel Over a year ago

Actually I checked the explicit join, even replaced between the tables and always got the same plan. I didn't try to use indexes because I don't try to fix the problem, but to realize why postgres does such inefficient plan

wildplasser Over a year ago

In fact, a supporting index is needed for all FK fields. (otherwise, updates or deletes in the referred table would be very costly) Plus: do an ANALYZE on the table after creating the index.

Collectives™ on Stack Overflow

Postgres index the big table for hashing

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related