7

The postgres trigram documentation states:

The pg_trgm module provides GiST and GIN index operator classes that allow you to create an index over a text column for the purpose of very fast similarity searches. These index types support the above-described similarity operators, and additionally support trigram-based index searches for LIKE, ILIKE, ~ and ~* queries.

and shows the following example:

SELECT t, word_similarity('word', t) AS sml
  FROM test_trgm
  WHERE 'word' <% t
  ORDER BY sml DESC, t;

Awesome!

However, when running the following query:

SELECT * 
FROM place 
WHERE word_similarity(place.name, '__SOME_STRING__') > 0.5

The index that was created is not being used.

However, when using ILIKE or the %> operators, it does seem that the index is being used. Why is the index not used on the word_similarity function?

2 Answers 2

6

According to this postgres forum response

PostgreSQL doesn't use index scan with functions within WHERE clause. So you always need to use operators instead. You can try <% operator and pg_trgm.word_similarity_threshold variable:

=# SET pg_trgm.word_similarity_threshold TO 0.1;

=# SELECT name, popularity FROM temp.items3_v ,(values ('some phrase'::text)) consts(input) WHERE input <% name ORDER BY 2, input <<-> name;

So, the query can be updated to use the index as follows:

SET pg_trgm.word_similarity_threshold TO 0.1;
SELECT * 
FROM place 
WHERE place.name <<-> '__SOME_STRING__';

Warning: the operator only uses the index with only one version of the commutator pair. I.e., it only used the index in the case <<-> and not the case <->>. This stack overflow q/a post looks like it gives a reasonable explanation as to why:

These are different operations, and only one of them is supported by the index.

Sign up to request clarification or add additional context in comments.

Comments

0

With PostgeSQL 14, the similarity operator is % because WHERE clause only accept boolean operations.

The operators <<-> and <->> do not work in WHERE clause, psql complain about the return value that is a real number, and not a boolean, with the following message:

argument of WHERE must be type boolean, not type real

So, instead of:

> EXPLAIN ANALYZE
  SELECT wikidata_url, surface, similarity(surface, 'Victor Hugo') AS score 
  FROM entities 
  ORDER BY score 
  DESC LIMIT 100;
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                        |
|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Limit  (cost=488812.86..488824.53 rows=100 width=60) (actual time=10917.954..10922.771 rows=100 loops=1)                                          |
|   ->  Gather Merge  (cost=488812.86..1969466.87 rows=12690434 width=60) (actual time=10915.749..10920.559 rows=100 loops=1)                       |
|         Workers Planned: 2                                                                                                                        |
|         Workers Launched: 2                                                                                                                       |
|         ->  Sort  (cost=487812.84..503675.88 rows=6345217 width=60) (actual time=10899.929..10899.932 rows=75 loops=3)                            |
|               Sort Key: (similarity(surface, 'Victor Hugo'::text)) DESC                                                                           |
|               Sort Method: top-N heapsort  Memory: 48kB                                                                                           |
|               Worker 0:  Sort Method: top-N heapsort  Memory: 39kB                                                                                |
|               Worker 1:  Sort Method: top-N heapsort  Memory: 40kB                                                                                |
|               ->  Parallel Seq Scan on entities  (cost=0.00..245303.21 rows=6345217 width=60) (actual time=1.592..10271.425 rows=5075357 loops=3) |
| Planning Time: 0.105 ms                                                                                                                           |
| JIT:                                                                                                                                              |
|   Functions: 7                                                                                                                                    |
|   Options: Inlining false, Optimization false, Expressions true, Deforming true                                                                   |
|   Timing: Generation 0.513 ms, Inlining 0.000 ms, Optimization 0.455 ms, Emission 6.380 ms, Total 7.349 ms                                        |
| Execution Time: 10923.139 ms                                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------------------------------+

Or something...

To use the gin, or gist index, and speed up the query, you need to specify both a threshold, and a WHERE clause:

SET pg_trgm.word_similarity_threshold TO 0.75;
SELECT wikidata_url, surface, similarity(surface, 'Victor Hugo') AS score 
FROM entities 

WHERE surface % 'Victor Hugo'

ORDER BY score DESC 
LIMIT 100;

Here is the query plan, that use the index called trgm_idx:

> EXPLAIN ANALYZE
SELECT wikidata_url, surface, similarity(surface, 'Victor Hugo') AS score 
FROM entities 

WHERE surface % 'Victor Hugo'

ORDER BY score 
DESC LIMIT 100;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                                |
|-------------------------------------------------------------------------------------------------------------------------------------------|
| Limit  (cost=5798.57..5798.82 rows=100 width=60) (actual time=2338.179..2338.189 rows=100 loops=1)                                        |
|   ->  Sort  (cost=5798.57..5802.38 rows=1522 width=60) (actual time=2338.178..2338.182 rows=100 loops=1)                                  |
|         Sort Key: (similarity(surface, 'Victor Hugo'::text)) DESC                                                                         |
|         Sort Method: top-N heapsort  Memory: 48kB                                                                                         |
|         ->  Bitmap Heap Scan on entities  (cost=88.21..5740.40 rows=1522 width=60) (actual time=2287.892..2336.556 rows=13828 loops=1)    |
|               Recheck Cond: (surface % 'Victor Hugo'::text)                                                                               |
|               Heap Blocks: exact=9422                                                                                                     |
|               ->  Bitmap Index Scan on trgm_idx  (cost=0.00..87.83 rows=1522 width=0) (actual time=2286.962..2286.963 rows=13828 loops=1) |
|                     Index Cond: (surface % 'Victor Hugo'::text)                                                                           |
| Planning Time: 0.679 ms                                                                                                                   |
| Execution Time: 2338.234 ms        

The code to create the index is something like:

CREATE INDEX trgm_idx ON entities USING GIST (surface gist_trgm_ops);

If you wonder what is the difference between similarity, and word_similarity, look into https://dba.stackexchange.com/q/184716/6139

Or read the official documentation at https://www.postgresql.org/docs/current/pgtrgm.html

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.