With PostgeSQL 14, the similarity operator is % because WHERE clause only accept boolean operations.
The operators <<-> and <->> do not work in WHERE clause, psql complain about the return value that is a real number, and not a boolean, with the following message:
argument of WHERE must be type boolean, not type real
So, instead of:
> EXPLAIN ANALYZE
SELECT wikidata_url, surface, similarity(surface, 'Victor Hugo') AS score
FROM entities
ORDER BY score
DESC LIMIT 100;
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Limit (cost=488812.86..488824.53 rows=100 width=60) (actual time=10917.954..10922.771 rows=100 loops=1) |
| -> Gather Merge (cost=488812.86..1969466.87 rows=12690434 width=60) (actual time=10915.749..10920.559 rows=100 loops=1) |
| Workers Planned: 2 |
| Workers Launched: 2 |
| -> Sort (cost=487812.84..503675.88 rows=6345217 width=60) (actual time=10899.929..10899.932 rows=75 loops=3) |
| Sort Key: (similarity(surface, 'Victor Hugo'::text)) DESC |
| Sort Method: top-N heapsort Memory: 48kB |
| Worker 0: Sort Method: top-N heapsort Memory: 39kB |
| Worker 1: Sort Method: top-N heapsort Memory: 40kB |
| -> Parallel Seq Scan on entities (cost=0.00..245303.21 rows=6345217 width=60) (actual time=1.592..10271.425 rows=5075357 loops=3) |
| Planning Time: 0.105 ms |
| JIT: |
| Functions: 7 |
| Options: Inlining false, Optimization false, Expressions true, Deforming true |
| Timing: Generation 0.513 ms, Inlining 0.000 ms, Optimization 0.455 ms, Emission 6.380 ms, Total 7.349 ms |
| Execution Time: 10923.139 ms |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
Or something...
To use the gin, or gist index, and speed up the query, you need to specify both a threshold, and a WHERE clause:
SET pg_trgm.word_similarity_threshold TO 0.75;
SELECT wikidata_url, surface, similarity(surface, 'Victor Hugo') AS score
FROM entities
WHERE surface % 'Victor Hugo'
ORDER BY score DESC
LIMIT 100;
Here is the query plan, that use the index called trgm_idx:
> EXPLAIN ANALYZE
SELECT wikidata_url, surface, similarity(surface, 'Victor Hugo') AS score
FROM entities
WHERE surface % 'Victor Hugo'
ORDER BY score
DESC LIMIT 100;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
|-------------------------------------------------------------------------------------------------------------------------------------------|
| Limit (cost=5798.57..5798.82 rows=100 width=60) (actual time=2338.179..2338.189 rows=100 loops=1) |
| -> Sort (cost=5798.57..5802.38 rows=1522 width=60) (actual time=2338.178..2338.182 rows=100 loops=1) |
| Sort Key: (similarity(surface, 'Victor Hugo'::text)) DESC |
| Sort Method: top-N heapsort Memory: 48kB |
| -> Bitmap Heap Scan on entities (cost=88.21..5740.40 rows=1522 width=60) (actual time=2287.892..2336.556 rows=13828 loops=1) |
| Recheck Cond: (surface % 'Victor Hugo'::text) |
| Heap Blocks: exact=9422 |
| -> Bitmap Index Scan on trgm_idx (cost=0.00..87.83 rows=1522 width=0) (actual time=2286.962..2286.963 rows=13828 loops=1) |
| Index Cond: (surface % 'Victor Hugo'::text) |
| Planning Time: 0.679 ms |
| Execution Time: 2338.234 ms
The code to create the index is something like:
CREATE INDEX trgm_idx ON entities USING GIST (surface gist_trgm_ops);
If you wonder what is the difference between similarity, and word_similarity, look into https://dba.stackexchange.com/q/184716/6139
Or read the official documentation at https://www.postgresql.org/docs/current/pgtrgm.html