I need to join 2 tables in Postgres on multiple columns. The columns have a different type in each table.
This is not a question how to join the tables (this has been asked and answered several times), but how to join them efficiently, with full usage of the indexes. The tables have millions of records.
If I am to design a new table with columns always having int value, but on which I need to JOIN to another already existing old table with text values, what would lead to better performance:
- Designing the columns as
intand sacrificing the performance of theJOIN, or - Designing the columns as
textto avoid casting and have theJOIN"direct", and sacrificing the size of the columns and the performance of the indexes (I assume that an index onintis much more efficient than an index ontext).
CREATE TABLE table_new
(
id1 int NOT NULL,
id2 int NOT NULL,
-- other columns
);
CREATE INDEX idx_new_id ON table_new (id1, id2);
CREATE TABLE table_old
(
id1 text NOT NULL,
id2 text NOT NULL,
-- other columns
);
CREATE INDEX idx_old_id ON table_old (id1, id2);
I need to join table_old and table_new on id1 and id2. I know that I can use various forms of cast (as described e.g. Join two tables in postgresql if type of columns is different).
What I effectively need to achieve is an equivalent of
SELECT * FROM table_old
JOIN table_new
ON table_old.id1 = table_new.id1 AND table_old.id2 = table_new.id2;
However, how to do it so it uses the indexes, or how to create the indexes that the join is efficient?
I am worried that any form of casting id1 and id2 between int and text will destroy the performance.
To avoid the XY problem, here is the wider context:
table_oldis a legacy table which I am not allowed to change.table_newis my new table which I am designing.- Both the tables can have millions of records.
- I know that in
table_newthe columnsid1andid2always containintvalue, so I wanted to make the table and its indexes as small and as fast as possible, hence I designed them asint. - If there is no reasonable way to effectively
JOINthe tables with different types of column, I can also accept as a solution that the columns would betext. - What I really need to be fast is the
JOIN.
textgenerated columns will have the same size and efficiency as index on actualtextcolumn.work_memis big enough to make a hash join feasible, you don't need an index. And a nested loop join might need other indexes than a merge join. So it is impossible to give a definite answer without some experimentation.