Why is postgres trigram word_similarity function not using a gin index?

Question

The postgres trigram documentation states:

The pg_trgm module provides GiST and GIN index operator classes that allow you to create an index over a text column for the purpose of very fast similarity searches. These index types support the above-described similarity operators, and additionally support trigram-based index searches for LIKE, ILIKE, ~ and ~* queries.

and shows the following example:

SELECT t, word_similarity('word', t) AS sml
  FROM test_trgm
  WHERE 'word' <% t
  ORDER BY sml DESC, t;

Awesome!

However, when running the following query:

SELECT * 
FROM place 
WHERE word_similarity(place.name, '__SOME_STRING__') > 0.5

The index that was created is not being used.

However, when using ILIKE or the %> operators, it does seem that the index is being used. Why is the index not used on the word_similarity function?

Ulad Kasach · Accepted Answer · 2020-02-11 21:12:09Z

According to this postgres forum response

PostgreSQL doesn't use index scan with functions within WHERE clause. So you always need to use operators instead. You can try <% operator and pg_trgm.word_similarity_threshold variable:

=# SET pg_trgm.word_similarity_threshold TO 0.1;

=# SELECT name, popularity FROM temp.items3_v ,(values ('some phrase'::text)) consts(input) WHERE input <% name ORDER BY 2, input <<-> name;

So, the query can be updated to use the index as follows:

SET pg_trgm.word_similarity_threshold TO 0.1;
SELECT * 
FROM place 
WHERE place.name <<-> '__SOME_STRING__';

Warning: the operator only uses the index with only one version of the commutator pair. I.e., it only used the index in the case <<-> and not the case <->>. This stack overflow q/a post looks like it gives a reasonable explanation as to why:

These are different operations, and only one of them is supported by the index.

amirouche · Accepted Answer · 2024-07-18 15:43:56Z

With PostgeSQL 14, the similarity operator is % because WHERE clause only accept boolean operations.

The operators <<-> and <->> do not work in WHERE clause, psql complain about the return value that is a real number, and not a boolean, with the following message:

argument of WHERE must be type boolean, not type real

So, instead of:

> EXPLAIN ANALYZE
  SELECT wikidata_url, surface, similarity(surface, 'Victor Hugo') AS score 
  FROM entities 
  ORDER BY score 
  DESC LIMIT 100;
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                        |
|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Limit  (cost=488812.86..488824.53 rows=100 width=60) (actual time=10917.954..10922.771 rows=100 loops=1)                                          |
|   ->  Gather Merge  (cost=488812.86..1969466.87 rows=12690434 width=60) (actual time=10915.749..10920.559 rows=100 loops=1)                       |
|         Workers Planned: 2                                                                                                                        |
|         Workers Launched: 2                                                                                                                       |
|         ->  Sort  (cost=487812.84..503675.88 rows=6345217 width=60) (actual time=10899.929..10899.932 rows=75 loops=3)                            |
|               Sort Key: (similarity(surface, 'Victor Hugo'::text)) DESC                                                                           |
|               Sort Method: top-N heapsort  Memory: 48kB                                                                                           |
|               Worker 0:  Sort Method: top-N heapsort  Memory: 39kB                                                                                |
|               Worker 1:  Sort Method: top-N heapsort  Memory: 40kB                                                                                |
|               ->  Parallel Seq Scan on entities  (cost=0.00..245303.21 rows=6345217 width=60) (actual time=1.592..10271.425 rows=5075357 loops=3) |
| Planning Time: 0.105 ms                                                                                                                           |
| JIT:                                                                                                                                              |
|   Functions: 7                                                                                                                                    |
|   Options: Inlining false, Optimization false, Expressions true, Deforming true                                                                   |
|   Timing: Generation 0.513 ms, Inlining 0.000 ms, Optimization 0.455 ms, Emission 6.380 ms, Total 7.349 ms                                        |
| Execution Time: 10923.139 ms                                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------------------------------+

Or something...

To use the gin, or gist index, and speed up the query, you need to specify both a threshold, and a WHERE clause:

SET pg_trgm.word_similarity_threshold TO 0.75;
SELECT wikidata_url, surface, similarity(surface, 'Victor Hugo') AS score 
FROM entities 

WHERE surface % 'Victor Hugo'

ORDER BY score DESC 
LIMIT 100;

Here is the query plan, that use the index called trgm_idx:

> EXPLAIN ANALYZE
SELECT wikidata_url, surface, similarity(surface, 'Victor Hugo') AS score 
FROM entities 

WHERE surface % 'Victor Hugo'

ORDER BY score 
DESC LIMIT 100;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                |
|-------------------------------------------------------------------------------------------------------------------------------------------|
| Limit  (cost=5798.57..5798.82 rows=100 width=60) (actual time=2338.179..2338.189 rows=100 loops=1)                                        |
|   ->  Sort  (cost=5798.57..5802.38 rows=1522 width=60) (actual time=2338.178..2338.182 rows=100 loops=1)                                  |
|         Sort Key: (similarity(surface, 'Victor Hugo'::text)) DESC                                                                         |
|         Sort Method: top-N heapsort  Memory: 48kB                                                                                         |
|         ->  Bitmap Heap Scan on entities  (cost=88.21..5740.40 rows=1522 width=60) (actual time=2287.892..2336.556 rows=13828 loops=1)    |
|               Recheck Cond: (surface % 'Victor Hugo'::text)                                                                               |
|               Heap Blocks: exact=9422                                                                                                     |
|               ->  Bitmap Index Scan on trgm_idx  (cost=0.00..87.83 rows=1522 width=0) (actual time=2286.962..2286.963 rows=13828 loops=1) |
|                     Index Cond: (surface % 'Victor Hugo'::text)                                                                           |
| Planning Time: 0.679 ms                                                                                                                   |
| Execution Time: 2338.234 ms

The code to create the index is something like:

CREATE INDEX trgm_idx ON entities USING GIST (surface gist_trgm_ops);

If you wonder what is the difference between similarity, and word_similarity, look into https://dba.stackexchange.com/q/184716/6139

Or read the official documentation at https://www.postgresql.org/docs/current/pgtrgm.html

Also look into Finding similar strings with PostgreSQL quickly

Collectives™ on Stack Overflow

Why is postgres trigram word_similarity function not using a gin index?

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related