Performance of JOIN on columns which have different type in each table

Question

I need to join 2 tables in Postgres on multiple columns. The columns have a different type in each table.

This is not a question how to join the tables (this has been asked and answered several times), but how to join them efficiently, with full usage of the indexes. The tables have millions of records.

If I am to design a new table with columns always having int value, but on which I need to JOIN to another already existing old table with text values, what would lead to better performance:

Designing the columns as int and sacrificing the performance of the JOIN, or
Designing the columns as text to avoid casting and have the JOIN "direct", and sacrificing the size of the columns and the performance of the indexes (I assume that an index on int is much more efficient than an index on text).

CREATE TABLE table_new
(
    id1 int NOT NULL,
    id2 int NOT NULL,
    -- other columns
);
CREATE INDEX idx_new_id ON table_new (id1, id2);

CREATE TABLE table_old
(
    id1 text NOT NULL,
    id2 text NOT NULL,
    -- other columns
);
CREATE INDEX idx_old_id ON table_old (id1, id2);

I need to join table_old and table_new on id1 and id2. I know that I can use various forms of cast (as described e.g. Join two tables in postgresql if type of columns is different).

What I effectively need to achieve is an equivalent of

SELECT * FROM table_old
    JOIN table_new 
    ON table_old.id1 = table_new.id1 AND table_old.id2 = table_new.id2;

However, how to do it so it uses the indexes, or how to create the indexes that the join is efficient?

I am worried that any form of casting id1 and id2 between int and text will destroy the performance.

To avoid the XY problem, here is the wider context:

table_old is a legacy table which I am not allowed to change.
table_new is my new table which I am designing.
Both the tables can have millions of records.
I know that in table_new the columns id1 and id2 always contain int value, so I wanted to make the table and its indexes as small and as fast as possible, hence I designed them as int.
If there is no reasonable way to effectively JOIN the tables with different types of column, I can also accept as a solution that the columns would be text.
What I really need to be fast is the JOIN.

Have you considered to add text copies of id1 and id2 as generated columns_ to table_new. Create index and use these when joining table_old. postgresql.org/docs/current/ddl-generated-columns.html — jarlh
– jarlh, Commented Sep 2, 2024 at 18:30
@jarlh Index on text generated columns will have the same size and efficiency as index on actual text column. — Honza Zidek
– Honza Zidek, Commented Sep 2, 2024 at 18:59
That's impossible to answer, because there are different join strategies in PostgreSQL. The hash join, for example, doesn't benefit from indexes at all. So if your work_mem is big enough to make a hash join feasible, you don't need an index. And a nested loop join might need other indexes than a merge join. So it is impossible to give a definite answer without some experimentation. — Laurenz Albe
– Laurenz Albe, Commented Sep 3, 2024 at 2:20
I would [personally] clean up the data and convert to INT as an initial step. Who knows if those TEXT columns have garbage non-int values in there, that won't join with anything ever. — The Impaler
– The Impaler, Commented Sep 3, 2024 at 12:46

jarlh · Accepted Answer · 2024-09-02 18:31:56Z

4

Functional indexes to the rescue!

CREATE TEMP TABLE t_old (tid text PRIMARY KEY);
CREATE TEMP TABLE t_new (iid int NOT NULL);
CREATE INDEX t_old_int_idx ON t_old ((tid::int));
CREATE INDEX t_new_int_idx ON t_new (iid);
INSERT INTO t_old SELECT i::text FROM generate_series(1,99999) i;
INSERT INTO t_new VALUES (10),(20),(30);
ANALYSE t_old;
ANALYSE t_new;
EXPLAIN ANALYSE SELECT t_new.*, t_old.* FROM t_new JOIN t_old ON t_new.iid = t_old.tid::int;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                            QUERY PLAN                                                             │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Merge Join  (cost=1.69..2.51 rows=3 width=9) (actual time=0.063..0.082 rows=3 loops=1)                                            │
│   Merge Cond: ((t_old.tid)::integer = t_new.iid)                                                                                  │
│   ->  Index Scan using t_old_int_idx on t_old  (cost=0.29..3050.28 rows=99999 width=5) (actual time=0.016..0.028 rows=31 loops=1) │
│   ->  Sort  (cost=1.05..1.06 rows=3 width=4) (actual time=0.021..0.022 rows=3 loops=1)                                            │
│         Sort Key: t_new.iid                                                                                                       │
│         Sort Method: quicksort  Memory: 25kB                                                                                      │
│         ->  Seq Scan on t_new  (cost=0.00..1.03 rows=3 width=4) (actual time=0.009..0.010 rows=3 loops=1)                         │
│ Planning Time: 0.485 ms                                                                                                           │
│ Execution Time: 0.127 ms                                                                                                          │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

If not all the values in your old table are in fact valid integers you might need to make your functional index partial too.

edited Sep 2, 2024 at 18:31

jarlh

44.9k8 gold badges52 silver badges68 bronze badges

answered Sep 2, 2024 at 17:52

Richard Huxton

23.6k5 gold badges43 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Honza Zidek Over a year ago

What will happen in CREATE INDEX t_old_int_idx ON t_old ((tid::int)); if t_old.tid is not int or is NULL?

Laurenz Albe Over a year ago

Then CREATE INDEX will get an error.

Zegarek Over a year ago

If t_old.tid isn't a valid int, you will need to make that a partial index, as already pointed out in the answer. In pg16 and above you can add a where pg_input_is_valid(tid,'integer'). The function isn't available in v15 and below but you can easily add your own.

The Impaler Over a year ago

I think the strategy is sound but I would cast towards TEXT rather than INT for safety. Also the strategy should account for nulls. Apart from this it should perform well.

Collectives™ on Stack Overflow

Performance of JOIN on columns which have different type in each table

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related