0

We have a musicians table containing records with multiple string fields, say:

  • "Jimi", "Hendrix", "Guitar"
  • "Phil", "Collins", "Drums"
  • "Sting", "", "Bass"
  • "Ringo", "Starr", "Drums"
  • "Paul", "McCartney", "Bass"

I want to pass postgres a long string, say:

"It is known that Jimi liked to set light to his guitar and smash up all the drums while on stage."

and i want to get returned the fields that have any matches - preferably in order of the most matches first:

  • "Jimi", "Hendrix", "Guitar"
  • "Phil", "Collins", "Drums"
  • "Ringo", "Starr", "Drums"

because i need the search to be case insensitive, i'm constructing a query like this...

select * from musicians where lowercase_string like '%'||firstname||'%' or  lowercase_string like '%'||lastname||'%' or lowercase_string like '%'||instrument||'%'

and then looping through (in ruby in my case) to capture the result with the most matches.

this is however very slow in the sql stage (1 minute+).

i've tried adding lower-case GIN index using pg_trgm as suggested here - but it's not helping - presumably because the like query is back to front?

Thanks!

7
  • 1
    What if the "long string" contains drum, instead of drums? Is full text search acceptable? (It has a similar, but more configurable ranking system). Commented Mar 9, 2017 at 16:29
  • i think that would be fine - false positives could be filtered later - but how would full text search work when the "text" is outside of postgres? maybe i've misunderstood how it works? Commented Mar 9, 2017 at 16:36
  • What do you mean by outside of postgres? In your example lowercase_string seems to be "inside" postgres :) -- Obviously, just use/bind it as a query parameter (which depends on your client). With prepared statements, it seems something like ORDER BY ts_rank(to_tsvector(?), plainto_tsquery(firstname) || plainto_tsquery(lastname) || ...) in most DBALs. Commented Mar 9, 2017 at 16:43
  • And you could filter the results with FTS too, there is no need to filter "false positives" later. Commented Mar 9, 2017 at 16:45
  • Maybe worth to mention that neither FTS, nor pg_trgm+LIKE will use any index, because of your "reverse" matching. -- But if you don't need to loop through the results in ruby, you'll increase the performance anyway. And you could still choose the simple dictionary in FTS to make drum & drums distinct. Commented Mar 9, 2017 at 16:49

2 Answers 2

4

With my testing, it seems that no trigram index could help your query at all. And no other index type could possibly speed up an (I)LIKE / FTS based search.

I should mention that all of the queries below use the trigram indexes, when they are queried "reversed": when the table contains the document (which is indexed), and your parameter is the query. The (I)LIKE variant variant f.ex. 2-3 times faster with it.

These the queries I've tested:

select *
from   musicians
where  :input_string ilike '%' || firstname  || '%'
or     :input_string ilike '%' || lastname   || '%'
or     :input_string ilike '%' || instrument || '%'

At first, FTS seemed a great idea, but my testing shows that even without ranking, it is 60-100 times slower than the (I)LIKE variant. (So even, when you don't have to post-process results with these methods, these are not worth it).

select *
from   musicians
where  to_tsvector(:input_string) @@ (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))

However, ORDER BY rank doesn't slow down that much further: it is 70-120 times slower than the (I)LIKE variant.

select   *
from     musicians
where    to_tsvector(:input_string) @@ (plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))
order by ts_rank(to_tsvector(:input_string), plainto_tsquery(firstname) || plainto_tsquery(lastname) || plainto_tsquery(lastname))

Then, for a last effort, I tried the (fairly new) "word similarity" operators of the trigram module: <% and %> (available from PostgreSQL 9.6).

select *
from   musicians
where  :input_string %> firstname
or     :input_string %> lastname
or     :input_string %> instrument

select *
from   musicians
where  firstname  <% :input_string
or     lastname   <% :input_string
or     instrument <% :input_string

These were somewhat faster then FTS: around 50-70 times slower than the (I)LIKE variant.

(Partially working) rextester: it is run against PostgreSQL 9.5, so the 9.6 operators obviously won't run here.

Update: IF full word match is enough for you, you can actually reverse your query, to be able to use indexes. You'll need to "parse" your query (aka. "long string") though:

with long_string(ls) as (
  values (:input_string)
),
words(word) as (
  select s
  from   long_string, regexp_split_to_table(ls, '[^[:alnum:]]+') s
  where  s <> ''
)
select   musicians.*
from     musicians, words
where    firstname  ilike word
or       lastname   ilike word
or       instrument ilike word
group by musicians.id

Note: I parsed the query for every complete word. You can have some other logic there, or it can even be parsed in client side.

The default, btree index shines here, as it is much faster than the trigram index with (I)LIKE (we won't need them anyway, as we are looking for complete word match here):

with long_string(ls) as (
  values (:input_string)
),
words(word) as (
  select s
  from   long_string, regexp_split_to_table(lower(ls), '[^[:alnum:]]+') s
  where  s <> ''
)
select   musicians.*
from     musicians, words
where    lower(firstname)  = word
or       lower(lastname)   = word
or       lower(instrument) = word
group by musicians.id

http://rextester.com/PSABJ6745

You could even get the match count with something like

sum((lower(firstname)  = word)::int
  + (lower(lastname)   = word)::int
  + (lower(instrument) = word)::int)
Sign up to request clarification or add additional context in comments.

2 Comments

thanks @pozs - my experiments came to the same conclusion: that indexes cannot help me here and that LIKE is the fastest of the slow options...
@user1051849 If word matches enough for you, you could try to "parse" the query itself (f.ex. split into words by whitespaces) & then try to match word-by-word, like firstname ilike 'It' or lastname ilike 'is'. Or, at this point, you could use the default btree` indexes & the equality operator = for filtering.
2

The ilike option with match ordering:

with long_string (ls) as (values
    ('It is known that Jimi liked to set light to his guitar and smash up all the drums while on stage.')
)
select musicians.*, matches
from
    musicians
    cross join
    long_string
    cross join lateral
    (select
        (ls ilike format ('%%%s%%', first_name) and first_name != '')::int +
        (ls ilike format ('%%%s%%', last_name) and last_name != '')::int +
        (ls ilike format ('%%%s%%', instrument) and instrument != '')::int 
        as matches
    ) m
where matches > 0
order by matches desc
;
 first_name | last_name | instrument | matches 
------------+-----------+------------+---------
 Jimi       | Hendrix   | Guitar     |       2
 Phil       | Collins   | Drums      |       1
 Ringo      | Starr     | Drums      |       1

1 Comment

+1, the first_name != '' parts are a good catch, and in total, this is a great idea for reducing post-processing costs. It will not, however, reduce the cost of the query itself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.