In a project, we are using postgresql v12.11 (updating is sadly not an option at the moment).
Consider the following relations:
document (
id uuid primary key
)
page (
id uuid primary key,
document_id uuid not null references document
)
word (
id uuid primary key,
page_id uuid not null references page,
text text not null
)
I had to implement a feature to find words across given documents by a given similarity threshold quickly, and after some research chose to use trigram matching with a GIN index.
This are my indices:
CREATE INDEX word_page_id_idx ON word (page_id);
CREATE INDEX word_text_idx ON word USING GIN (text gin_trgm_ops);
CREATE INDEX page_document_id ON page (document_id);
This is the query I came up with:
SELECT w.*, word_similarity(:searchString, w.text) similarity
FROM document d
INNER JOIN page p ON p.document_id = d.id
INNER JOIN word w ON w.page_id = p.id
WHERE d.id IN (:documentIds)
AND :searchString <% w.text
ORDER BY similarity DESC, w.text;
It worked very good in the beginning, but now with ~20kk rows in the words table it starts to get slower, especially with multiple documents.
My knowledge about indices and the query planner is not deep enough to understand what exactly is going on here, or why the query is getting slower.
Here is what EXPLAIN ANALYZE gives me for a query with 18 document ids. I wonder why the plan considers 185700 word rows, although the given documents in total have only around 9k words associated with them, but where do I proceed from here? AI suggested to create a combined GIN index on (page_id, text) which does not work...
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Sort (cost=11351.77..11351.79 rows=8 width=63) (actual time=3825.241..3825.247 rows=74 loops=1) |
| Sort Key: (word_similarity('05.06.2025'::text, w.text)) DESC, w.text |
| Sort Method: quicksort Memory: 35kB |
| -> Nested Loop (cost=248.89..11351.65 rows=8 width=63) (actual time=130.456..3825.106 rows=74 loops=1) |
| -> Nested Loop (cost=0.83..138.23 rows=45 width=16) (actual time=0.517..4.050 rows=28 loops=1) |
| -> Index Only Scan using document_pkey on document d (cost=0.41..31.98 rows=18 width=16) (actual time=0.010..0.217 rows=18 loops=1) |
| Index Cond: (id = ANY ('{470d35da-a65e-4eb5-aae1-1ce9334ff617,44f05ab8-94f7-4e90-9008-a224b0f2c458,c9b8161d-2975-4902-884f-11c658960ca0,9315c0ec-b460-4044-b4a3-79b77f40faea,b6e0a7ef-37c3-4540-aeb9-424567e3c51f,0b0adbbe-7d70-484d-a589-f5952dd9c4ae,5f51a47e-11cc-4bf4-a3f6-d1452262d82f,099dfc29-1803-4a87-8820-891697b26047,3e6a8d9c-a40b-4980-a49e-8285eee4dedc,4078b105-aff4-478d-97ae-2b785b1bdd08,30b7a7f6-b67a-4f0c-8cfb-e686a01b8e89,7265f939-7f47-4264-8a72-43f71312ba74,3e03b9b5-2c24-4d7a-b063-5bec34ad6e1e,23f69244-81a9-4dbc-984d-21bfd2bd0147,39837f1c-491e-4b43-9c86-713c7b16ec7a,5ba1dce8-97e0-4e22-8685-e018c4dbbf31,15e34009-1ed7-470a-9b16-34fc416595eb,9fb7604b-eafd-49be-b8cf-d9f2c5d663bf}'::uuid[]))|
| Heap Fetches: 0 |
| -> Index Scan using page_document_id on page p (cost=0.42..5.86 rows=4 width=32) (actual time=0.203..0.209 rows=2 loops=18) |
| Index Cond: (document_id = d.id) |
| -> Bitmap Heap Scan on word ow (cost=248.06..249.18 rows=1 width=59) (actual time=136.266..136.449 rows=3 loops=28) |
| Recheck Cond: (page_id = p.id) |
| Filter: ('05.06.2025'::text <% text) |
| Rows Removed by Filter: 3 |
| Heap Blocks: exact=50 |
| -> BitmapAnd (cost=248.06..248.06 rows=1 width=0) (actual time=135.254..135.254 rows=0 loops=28) |
| -> Bitmap Index Scan on word_page_id_idx (cost=0.00..9.16 rows=761 width=0) (actual time=0.021..0.021 rows=251 loops=28) |
| Index Cond: (page_id = p.id) |
| -> Bitmap Index Scan on word_text_idx (cost=0.00..233.26 rows=21568 width=0) (actual time=135.229..135.229 rows=185700 loops=28) |
| Index Cond: (text %> '05.06.2025'::text) |
|Planning Time: 26.187 ms |
|Execution Time: 3825.320 ms |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
05.06.2025, so that is bound to be slow. Dropping the trigram index would improve the query performance...