1

I need to join 2 tables in Postgres on multiple columns. The columns have a different type in each table.

This is not a question how to join the tables (this has been asked and answered several times), but how to join them efficiently, with full usage of the indexes. The tables have millions of records.

If I am to design a new table with columns always having int value, but on which I need to JOIN to another already existing old table with text values, what would lead to better performance:

  1. Designing the columns as int and sacrificing the performance of the JOIN, or
  2. Designing the columns as text to avoid casting and have the JOIN "direct", and sacrificing the size of the columns and the performance of the indexes (I assume that an index on int is much more efficient than an index on text).
CREATE TABLE table_new
(
    id1 int NOT NULL,
    id2 int NOT NULL,
    -- other columns
);
CREATE INDEX idx_new_id ON table_new (id1, id2);

CREATE TABLE table_old
(
    id1 text NOT NULL,
    id2 text NOT NULL,
    -- other columns
);
CREATE INDEX idx_old_id ON table_old (id1, id2);

I need to join table_old and table_new on id1 and id2. I know that I can use various forms of cast (as described e.g. Join two tables in postgresql if type of columns is different).

What I effectively need to achieve is an equivalent of

SELECT * FROM table_old
    JOIN table_new 
    ON table_old.id1 = table_new.id1 AND table_old.id2 = table_new.id2;

However, how to do it so it uses the indexes, or how to create the indexes that the join is efficient?

I am worried that any form of casting id1 and id2 between int and text will destroy the performance.

To avoid the XY problem, here is the wider context:

  1. table_old is a legacy table which I am not allowed to change.
  2. table_new is my new table which I am designing.
  3. Both the tables can have millions of records.
  4. I know that in table_new the columns id1 and id2 always contain int value, so I wanted to make the table and its indexes as small and as fast as possible, hence I designed them as int.
  5. If there is no reasonable way to effectively JOIN the tables with different types of column, I can also accept as a solution that the columns would be text.
  6. What I really need to be fast is the JOIN.
6
  • 1
    Have you considered to add text copies of id1 and id2 as generated columns_ to table_new. Create index and use these when joining table_old. postgresql.org/docs/current/ddl-generated-columns.html Commented Sep 2, 2024 at 18:30
  • @jarlh Index on text generated columns will have the same size and efficiency as index on actual text column. Commented Sep 2, 2024 at 18:59
  • ericlippert.com/2012/12/17/performance-rant Commented Sep 2, 2024 at 19:14
  • That's impossible to answer, because there are different join strategies in PostgreSQL. The hash join, for example, doesn't benefit from indexes at all. So if your work_mem is big enough to make a hash join feasible, you don't need an index. And a nested loop join might need other indexes than a merge join. So it is impossible to give a definite answer without some experimentation. Commented Sep 3, 2024 at 2:20
  • 2
    I would [personally] clean up the data and convert to INT as an initial step. Who knows if those TEXT columns have garbage non-int values in there, that won't join with anything ever. Commented Sep 3, 2024 at 12:46

1 Answer 1

4

Functional indexes to the rescue!

CREATE TEMP TABLE t_old (tid text PRIMARY KEY);
CREATE TEMP TABLE t_new (iid int NOT NULL);
CREATE INDEX t_old_int_idx ON t_old ((tid::int));
CREATE INDEX t_new_int_idx ON t_new (iid);
INSERT INTO t_old SELECT i::text FROM generate_series(1,99999) i;
INSERT INTO t_new VALUES (10),(20),(30);
ANALYSE t_old;
ANALYSE t_new;
EXPLAIN ANALYSE SELECT t_new.*, t_old.* FROM t_new JOIN t_old ON t_new.iid = t_old.tid::int;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                            QUERY PLAN                                                             │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Merge Join  (cost=1.69..2.51 rows=3 width=9) (actual time=0.063..0.082 rows=3 loops=1)                                            │
│   Merge Cond: ((t_old.tid)::integer = t_new.iid)                                                                                  │
│   ->  Index Scan using t_old_int_idx on t_old  (cost=0.29..3050.28 rows=99999 width=5) (actual time=0.016..0.028 rows=31 loops=1) │
│   ->  Sort  (cost=1.05..1.06 rows=3 width=4) (actual time=0.021..0.022 rows=3 loops=1)                                            │
│         Sort Key: t_new.iid                                                                                                       │
│         Sort Method: quicksort  Memory: 25kB                                                                                      │
│         ->  Seq Scan on t_new  (cost=0.00..1.03 rows=3 width=4) (actual time=0.009..0.010 rows=3 loops=1)                         │
│ Planning Time: 0.485 ms                                                                                                           │
│ Execution Time: 0.127 ms                                                                                                          │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

If not all the values in your old table are in fact valid integers you might need to make your functional index partial too.

Sign up to request clarification or add additional context in comments.

4 Comments

What will happen in CREATE INDEX t_old_int_idx ON t_old ((tid::int)); if t_old.tid is not int or is NULL?
Then CREATE INDEX will get an error.
If t_old.tid isn't a valid int, you will need to make that a partial index, as already pointed out in the answer. In pg16 and above you can add a where pg_input_is_valid(tid,'integer'). The function isn't available in v15 and below but you can easily add your own.
I think the strategy is sound but I would cast towards TEXT rather than INT for safety. Also the strategy should account for nulls. Apart from this it should perform well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.