0

I have a TEXT keyvalues column in Postgres:

select * from test5 limit 5;

 id |                      keyvalues
----+------------------------------------------------------
  1 | ^ first 1 | second 3
  2 | ^ first 1 | second 2 ^ first 2 | second 3
  3 | ^ first 1 | second 2 | second 3
  4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
  5 | ^ first 2 | second 3 ^ first 1 | second 3

My queries must exclude the ^ character from the middle of the match, so I'm using regular expressions:

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
   ->  Gather  (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
               ->  Parallel Seq Scan on test5  (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
                     Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
                     Rows Removed by Filter: 1666668
 Planning Time: 0.068 ms
 Execution Time: 7337.184 ms

The query works (zero rows match), but is way too slow at > 7 seconds.

I thought indexing with trigrams would help, but no luck:

create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
   ->  Bitmap Heap Scan on test5  (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
         Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
         Rows Removed by Index Recheck: 5000005
         Heap Blocks: exact=47620
         ->  Bitmap Index Scan on test5_keyvalues_idx  (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
               Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
 Planning Time: 0.412 ms
 Execution Time: 23734.722 ms

The query with the trigram index is 3x slower! It still returns the correct result (zero rows). I expected the trigram index to figure out immediately there's no second 0 string anywhere, and be super fast.

(Motivation: I want to avoid normalizing the keyvalues into another table, so I'm looking to encode the matching logic in a single TEXT field using text indexing and regexps instead. The logic works, but is too slow, as is JSONB.)

4
  • You really should be storing those as individual records in a table instead of trying to combine them like that. Or at the very least, doesn't postgres support array types? Commented Jun 8, 2019 at 11:25
  • @Shawn No, Postgres doesn't support efficient ILIKE over TEXT arrays, only via the slow unnest() (sequential scan) AFAIK. But if you know a way, let me know! Commented Jun 8, 2019 at 11:29
  • Then try (only for a test purpose) to normalize this table and check how it will perform. I will not be surprised if it will be faster than yoy solution with regexp. Commented Jun 8, 2019 at 17:16
  • @krokodilko But that would defeat the purpose of this question. Btw you can see such results in the hyperlinks, if you like. Commented Jun 8, 2019 at 18:35

2 Answers 2

2

According to the OP, the correct answer was given here on DBA.SE by user @jjanes:

I expected the trigram index to figure out immediately there's no second 0 string anywhere

'second' and '0' are separate words, so it cannot detect their joint absence as such. It seems like it could detect the absence of ' 0', but this comment from "contrib/pg_trgm/trgm_regexp.c" seems pertinent:

     * Note: Using again the example "foo bar", we will not consider the
     * trigram "  b", though this trigram would be found by the trigram
     * extraction code.  Since we will find " ba", it doesn't seem worth
     * trying to hack the algorithm to generate the additional trigram.

Since 0 is the last character in the pattern string, there will be no trigram of the form " 0a", either, so it just misses that opportunity.

Even if it were not for this limitation, your approach seems extremely fragile.

Sign up to request clarification or add additional context in comments.

Comments

2

As you saw, this will not work well with trigrams. Trigrams do not match over the space boundary, so if all of your data contain the same words, the index will match every row.

This may make things more clear:

with data as (select * from (values ('^ first 1 | second 3'), 
                                    ('^ first 1 | second 2 ^ first 2 | second 3'), 
                                    ('^ first 1 | second 2 | second 3'), 
                                    ('^ first 2 | second 3 ^ first 1 | second 2 | second 2'), 
                                    ('^ first 2 | second 3 ^ first 1 | second 3')
                             ) v(keyvalues)
)
select keyvalues, show_trgm(keyvalues) from data;
                      keyvalues                       |                                               show_trgm
------------------------------------------------------+-------------------------------------------------------------------------------------------------------
 ^ first 1 | second 3                                 | {"  1","  3","  f","  s"," 1 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 1 | second 2 ^ first 2 | second 3            | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 1 | second 2 | second 3                      | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 2 | second 3 ^ first 1 | second 2 | second 2 | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 2 | second 3 ^ first 1 | second 3            | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}

Could you use a partial index to exclude the rows with ^ in the middle?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.