Postgres pg_trgm GIN index ignored in a specific join

Question

I have a table item with multiple text fields, like name, unique_attr, category, etc, and all of them I've indexed using the GIN (gin_trgm_ops) index for faster ilike queries, and indeed, even with a join to a table inventory_membership the indexes are used and speed up the execution time. Output of my explain:

   explain analyze select i.* from item i 
     join inventory_membership im on im.inventory_id = i.inventory_id
     where i.name ilike '%blu%' or unique_attr ilike '%blu%' or category ilike '%blu%' 
     or brand ilike '%blu%';

Hash Join  (cost=98.64..4584.98 rows=87302 width=478) (actual time=4.258..30.393 rows=57584 loops=1)
  Hash Cond: (i.inventory_id = im.inventory_id)
  ->  Bitmap Heap Scan on item i  (cost=95.45..3584.23 rows=4982 width=478) (actual time=3.706..10.529 rows=3340 loops=1)
        Recheck Cond: ((name ~~* '%blu%'::text) OR (unique_attr ~~* '%blu%'::text) OR (category ~~* '%blu%'::text) OR (brand ~~* '%blu%'::text))
        Heap Blocks: exact=715
        ->  BitmapOr  (cost=95.45..95.45 rows=5130 width=0) (actual time=3.622..3.622 rows=0 loops=1)
              ->  Bitmap Index Scan on item_name_idx  (cost=0.00..42.97 rows=3596 width=0) (actual time=1.612..1.612 rows=3160 loops=1)
                    Index Cond: (name ~~* '%blu%'::text)
              ->  Bitmap Index Scan on item_unique_attr_idx  (cost=0.00..12.01 rows=1 width=0) (actual time=0.586..0.586 rows=32 loops=1)
                    Index Cond: (unique_attr ~~* '%blu%'::text)
              ->  Bitmap Index Scan on item_category_idx  (cost=0.00..22.78 rows=1437 width=0) (actual time=0.888..0.888 rows=1394 loops=1)
                    Index Cond: (category ~~* '%blu%'::text)
              ->  Bitmap Index Scan on item_brand_idx  (cost=0.00..12.72 rows=96 width=0) (actual time=0.532..0.532 rows=42 loops=1)
                    Index Cond: (brand ~~* '%blu%'::text)
  ->  Hash  (cost=1.97..1.97 rows=97 width=4) (actual time=0.059..0.060 rows=87 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on inventory_membership im  (cost=0.00..1.97 rows=97 width=4) (actual time=0.010..0.032 rows=87 loops=1)
Planning Time: 0.924 ms
Execution Time: 42.093 ms

We can see the item_name_idx, item_unique_attr_idx, item_category_idx and item_brand_idx GIN indexes are being used to index the conditions. Great.

However, when I join another table (inventory table which only has id and name columns), the indexes disappear. Explain:

explain analyze select i.* from item i
    join inventory inv on inv.id = i.inventory_id
    join inventory_membership im on im.inventory_id = i.inventory_id
    where i.name ilike '%blu%' or unique_attr ilike '%blu%' or category ilike '%blu%' or brand 
    ilike '%blu%';

Hash Join  (cost=4.67..1172.61 rows=60407 width=478) (actual time=0.775..121.787 rows=57584 loops=1)
  Hash Cond: (inv.id = im.inventory_id)
  ->  Merge Join  (cost=1.49..440.81 rows=4982 width=482) (actual time=0.111..101.857 rows=3340 loops=1)
        Merge Cond: (i.inventory_id = inv.id)
        ->  Index Scan using item_inventory_id_idx on item i  (cost=0.29..13946.60 rows=4982 width=478) (actual time=0.085..99.857 rows=3340 loops=1)
              Filter: ((name ~~* '%blu%'::text) OR (unique_attr ~~* '%blu%'::text) OR (category ~~* '%blu%'::text) OR (brand ~~* '%blu%'::text))
              Rows Removed by Filter: 34858
        ->  Sort  (cost=1.20..1.22 rows=8 width=4) (actual time=0.020..0.025 rows=8 loops=1)
              Sort Key: inv.id
              Sort Method: quicksort  Memory: 25kB
              ->  Seq Scan on inventory inv  (cost=0.00..1.08 rows=8 width=4) (actual time=0.006..0.009 rows=8 loops=1)
  ->  Hash  (cost=1.97..1.97 rows=97 width=4) (actual time=0.650..0.651 rows=87 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on inventory_membership im  (cost=0.00..1.97 rows=97 width=4) (actual time=0.005..0.028 rows=87 loops=1)
Planning Time: 7.193 ms
Execution Time: 132.427 ms

And you can see the GIN indexes are gone and the only index the explain is using is the item_inventory_id_idx - which is the regular FK BTREE index. Also, the execution time went through the roof. Why?

How many rows are there in each table? If the number of rows in inventory is low, that might explain the merge join instead of a hash join - which is why the execution time is so high. Also, please add the query plans as strings, now we cannot see the actual time for the Seq Scan — Ruben Helsloot
– Ruben Helsloot, Commented Aug 19, 2020 at 12:56
@RubenHelsloot Ah, correct, there are only a few inventories - 8 or 9 in total. With around 4k items. As for the query plans, I'll edit the post in just a sec. — Ognjen Mišić
– Ognjen Mišić, Commented Aug 19, 2020 at 13:37
Just as an example, a similar thing happened here. Only difference is, they had a bigger offset in the expected number of rows. However, putting the item + where in a subquery or CTE might help for you too. — Ruben Helsloot
– Ruben Helsloot, Commented Aug 19, 2020 at 14:00
Ah so you mean i don't join the inventory but subquery it. It might work, I'll try it out. Also a correction up there, I have around 40k items, not 4k :D — Ognjen Mišić
– Ognjen Mišić, Commented Aug 19, 2020 at 14:05
@RubenHelsloot As I'm mainly interested in the name from inventory, i've indexed it now as well (as a regular btree is ok?), and explain analyze select i.*, (select name as inventoryName from inventory where id = i.inventory_id) from item i... has a bit longer execution time (168.170 ms), but a planning time of 0.974 ms. Compared to old 132.427 execution and 7.193 planning times. What does this mean? Also, I finally see my GIN indexes used on my items ! — Ognjen Mišić
– Ognjen Mišić, Commented Aug 19, 2020 at 14:14

Ruben Helsloot · Accepted Answer · 2020-08-19 14:39:35Z

1

You note that you are interested mostly in the inventory name, and that there are only 8 rows in the inventory table. The 8 rows is why the query planner prefers a merge join instead of the hash join, which works better when both tables are large. The merge join needed the inventory_id in a sorted list (which is exactly what an index is), meaning that it preferred not to use your GIN indexes, since it thought that would be less efficient.

Now, without the data, there are several things you can do, and I cannot tell which will be faster. The first, which you already tried, is to fetch the inventory name in a scalar subquery:

SELECT i.*, (select name from inventory where id = i.inventory_id) as inventoryName
FROM item i
JOIN inventory_membership im ON im.inventory_id = i.inventory_id
WHERE i.name ilike '%blu%' or unique_attr ilike '%blu%' or category ilike '%blu%' 
     or brand ilike '%blu%';

But that means this select statement is executed 57k times, once for each row. The second is to use the query you had, but see if changing i.inventory_id to inv.id in inventory_membership changes anything.

SELECT i.*, inv.name as inventoryName
FROM item i
JOIN inventory inv ON inv.id = i.inventory_id
JOIN inventory_membership im ON im.inventory_id = inv.id -- <- this changed
WHERE i.name ilike '%blu%' or unique_attr ilike '%blu%' or category ilike '%blu%' 
     or brand ilike '%blu%';

Finally, as it said in this question, you might force the first query to be executed, before getting the inventory name, using a CTE or subquery with OFFSET 0.

WITH my_items AS (
  SELECT i.*
  FROM item i
  JOIN inventory_membership im ON im.inventory_id = i.inventory_id
  WHERE i.name ilike '%blu%' or unique_attr ilike '%blu%' or category ilike '%blu%' 
       or brand ilike '%blu%'
)
SELECT i.*, inv.name as inventoryName
FROM my_items i
JOIN inventory inv ON inv.id = i.inventory_id

or

SELECT i.*, inv.name as inventoryName
FROM (
  SELECT i.*
  FROM item i
  JOIN inventory_membership im ON im.inventory_id = i.inventory_id
  WHERE i.name ilike '%blu%' or unique_attr ilike '%blu%' or category ilike '%blu%' 
       or brand ilike '%blu%'
  OFFSET 0 -- <- this forces the subquery to be evaluated separate from the rest of the query
) i
JOIN inventory inv ON inv.id = i.inventory_id

answered Aug 19, 2020 at 14:39

Ruben Helsloot

13.2k6 gold badges33 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ognjen Mišić Over a year ago

I really appreciate the effort and the help! What you said about "but see if changing i.inventory_id to inv.id in inventory_membership changes anything." - it does not unfortunately - I've tried it before. The "OFFSET 0" with the subquery actually forced the GIN indexes to be used and brought the execution time down to 73k ms! (planning is still around 1k). I'll soon get a bigger dataset with around 400k items and will do another round of EXPLAINs, thank you very much for the assistance so far!

Ruben Helsloot Over a year ago

Happy to help! One more note; those query plans are not measured in k's, but given in milliseconds with 3 decimals, so it's 1ms planning and 73ms execution time

Ognjen Mišić Over a year ago

Yeah yeah I'm aware, I just got sick of writing down those decimals and my brain switched gears :D

Collectives™ on Stack Overflow

Postgres pg_trgm GIN index ignored in a specific join

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related