How to improve word_similarity query performance in postgresql?

Question

In a project, we are using postgresql v12.11 (updating is sadly not an option at the moment).

Consider the following relations:

document (
  id uuid primary key
)

page (
  id uuid primary key,
  document_id uuid not null references document
)

word (
  id uuid primary key,
  page_id uuid not null references page,
  text text not null
)

I had to implement a feature to find words across given documents by a given similarity threshold quickly, and after some research chose to use trigram matching with a GIN index.

This are my indices:

CREATE INDEX word_page_id_idx ON word (page_id);
CREATE INDEX word_text_idx ON word USING GIN (text gin_trgm_ops);
CREATE INDEX page_document_id ON page (document_id);

This is the query I came up with:

SELECT w.*, word_similarity(:searchString, w.text) similarity
    FROM document d
             INNER JOIN page p ON p.document_id = d.id
             INNER JOIN word w ON w.page_id = p.id
    WHERE d.id IN (:documentIds)
    AND :searchString <% w.text
    ORDER BY similarity DESC, w.text;

It worked very good in the beginning, but now with ~20kk rows in the words table it starts to get slower, especially with multiple documents.

My knowledge about indices and the query planner is not deep enough to understand what exactly is going on here, or why the query is getting slower.

Here is what EXPLAIN ANALYZE gives me for a query with 18 document ids. I wonder why the plan considers 185700 word rows, although the given documents in total have only around 9k words associated with them, but where do I proceed from here? AI suggested to create a combined GIN index on (page_id, text) which does not work...

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Sort  (cost=11351.77..11351.79 rows=8 width=63) (actual time=3825.241..3825.247 rows=74 loops=1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|  Sort Key: (word_similarity('05.06.2025'::text, w.text)) DESC, w.text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|  Sort Method: quicksort  Memory: 35kB                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|  ->  Nested Loop  (cost=248.89..11351.65 rows=8 width=63) (actual time=130.456..3825.106 rows=74 loops=1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|        ->  Nested Loop  (cost=0.83..138.23 rows=45 width=16) (actual time=0.517..4.050 rows=28 loops=1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|              ->  Index Only Scan using document_pkey on document d  (cost=0.41..31.98 rows=18 width=16) (actual time=0.010..0.217 rows=18 loops=1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                    Index Cond: (id = ANY ('{470d35da-a65e-4eb5-aae1-1ce9334ff617,44f05ab8-94f7-4e90-9008-a224b0f2c458,c9b8161d-2975-4902-884f-11c658960ca0,9315c0ec-b460-4044-b4a3-79b77f40faea,b6e0a7ef-37c3-4540-aeb9-424567e3c51f,0b0adbbe-7d70-484d-a589-f5952dd9c4ae,5f51a47e-11cc-4bf4-a3f6-d1452262d82f,099dfc29-1803-4a87-8820-891697b26047,3e6a8d9c-a40b-4980-a49e-8285eee4dedc,4078b105-aff4-478d-97ae-2b785b1bdd08,30b7a7f6-b67a-4f0c-8cfb-e686a01b8e89,7265f939-7f47-4264-8a72-43f71312ba74,3e03b9b5-2c24-4d7a-b063-5bec34ad6e1e,23f69244-81a9-4dbc-984d-21bfd2bd0147,39837f1c-491e-4b43-9c86-713c7b16ec7a,5ba1dce8-97e0-4e22-8685-e018c4dbbf31,15e34009-1ed7-470a-9b16-34fc416595eb,9fb7604b-eafd-49be-b8cf-d9f2c5d663bf}'::uuid[]))|
|                    Heap Fetches: 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|              ->  Index Scan using page_document_id on page p  (cost=0.42..5.86 rows=4 width=32) (actual time=0.203..0.209 rows=2 loops=18)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                    Index Cond: (document_id = d.id)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|        ->  Bitmap Heap Scan on word ow  (cost=248.06..249.18 rows=1 width=59) (actual time=136.266..136.449 rows=3 loops=28)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|              Recheck Cond: (page_id = p.id)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|              Filter: ('05.06.2025'::text <% text)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|              Rows Removed by Filter: 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|              Heap Blocks: exact=50                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|              ->  BitmapAnd  (cost=248.06..248.06 rows=1 width=0) (actual time=135.254..135.254 rows=0 loops=28)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                    ->  Bitmap Index Scan on word_page_id_idx  (cost=0.00..9.16 rows=761 width=0) (actual time=0.021..0.021 rows=251 loops=28)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                          Index Cond: (page_id = p.id)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|                    ->  Bitmap Index Scan on word_text_idx  (cost=0.00..233.26 rows=21568 width=0) (actual time=135.229..135.229 rows=185700 loops=28)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                          Index Cond: (text %> '05.06.2025'::text)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|Planning Time: 26.187 ms                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|Execution Time: 3825.320 ms                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

This query plan formatting is broken, there must be tabs at plan nodes... — Rabban Keyak
– Rabban Keyak, Commented Sep 24 at 14:29
Hm I copied it from datagrip with "markdown" copy option... i check later, maybe I'll just append a screenshot then — T_01
– T_01, Commented Sep 24 at 14:51
There are 185700 rows that contain all the trigrams in 05.06.2025, so that is bound to be slow. Dropping the trigram index would improve the query performance... — Laurenz Albe
– Laurenz Albe, Commented Sep 24 at 15:27
As an analogy, the index at the back of a book helps you to find pages in that book where a given word appears. Why doesn't the index include common words like "the?" Because it would list every single page in the book, and flipping back and forth between that list and every respective page is less efficient than simply reading the book cover to cover. — Bill Karwin
– Bill Karwin, Commented Sep 24 at 18:19

Zegarek · Accepted Answer · 2025-09-25 16:28:48Z

If you're interested in top-N searches, use a GiST index, not GIN. F.33.4. Index Support:

This will return all values in the text column that are sufficiently similar to word, sorted from best match to worst. The index will be used to make this a fast operation even over very large data sets.
A variant of the above query is (...)
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

You can also enable pg_trgm to work in tandem with btree_gist to combine your indexes:
_{demo at db<>fiddle}

DROP INDEX IF EXISTS word_text_idx;
DROP INDEX IF EXISTS word_page_id_idx;
CREATE INDEX word_text_pageid_idx ON word 
  USING GiST (text gist_trgm_ops, page_id)--worth testing with flipped order
  INCLUDE(id)--enables index-only scan at the cost of larger index footprint
  WITH(fillfactor=100);--only if documents are static or batch-loaded

This should simplify the bottom/innermost part of your plan. Analogue btree_gin is also available, if you'd like to test the same with a GIN index.

I don't think you need to involve document, at all. The document.id is already present in page.document_id which you need to scan anyways, and you're not selecting anything else from document. If you're really only interested in the contents of word, cut out document:

SELECT w.*, :searchString <<-> w.text AS word_difference
FROM page p
JOIN word w ON w.page_id = p.id
WHERE p.document_id IN (:documentIds)
AND :searchString <% w.text
ORDER BY word_difference ASC, w.text
LIMIT 20;--if you don't really need all 74

An important note is that trigrams only include [:alnum:] class characters (whitespace and punctuation are ignored), so the . dots in '05.06.2025' only contribute to trigram boundaries but aren't included themselves. With default pg_trgm.word_similarity_threshold of 0.6, you might be catching surprising matches:

v	word_similarity	t<<->v	is caught by t <% v
06.2025.05	1	0	t
05.06.2025	1	0	t
06.05.2025	1	0	t
05.2025.06	1	0	t
025.20.05.06	0.8181818	0.18181819	t
07.06.2025	0.8	0.19999999	t
00.2025.06	0.8	0.19999999	t
05.2025	0.8	0.19999999	t
06.07.2025	0.6666667	0.3333333	t
25.06.2005	0.6	0.39999998	t
25.05.2006	0.6	0.39999998	t

As already pointed out by the commenters, none of this will help much unless your search is selective enough.

If you're looking for a date, you can filter your word table based on pg_input_isvalid("text",'date') which lets you safely cast "text"::date on the subset, then select

documents mentioning the specific date, not just something looking like one
N closest dates to the target June 5th 2025
only search within a specific date(multi)range.

Same applies to any data type: you could search documents for an IP address in a given network, a timestamp between specific ::times of day, date on a given weekday, long enough interval, a numeric of a desired scale.

Worth mentioning, but to make it fast, you'd have to extract and index data for each target type, at which point you're stepping into semantic analysis and might as well add pgvector to the mix.

Collectives™ on Stack Overflow

How to improve word_similarity query performance in postgresql?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related