7

I have a table with book titles and I want to select books that have title matching a regexp and to order results by the position of the regexp match in title.

It's easy for a single-word searches. E.g.

TABLE book
id   title
1    The Sun
2    The Dead Sun
3    Sun Kissed

I'm going to put .* between words in client's search term before sending query to DB, so I'd write SQL with prepared regexps here.

SELECT book.id, book.title FROM book
    WHERE book.title ~* '.*sun.*'
    ORDER BY COALESCE(NULLIF(position('sun' in book.title), 0), 999999) ASC;

RESULT
id   title
3    Sun Kissed
1    The Sun
2    The Dead Sun

But if search term has more than one word I want to match titles that have all words from search term with anything between them, and sort by the position like before, so I need a function that returns a position of regexp, I didn't find an appropriate one in official PostgreSQL docs.

TABLE books
id   title
4    Deep Space Endeavor
5    Star Trek: Deep Space Nine: The Never Ending Sacrifice
6    Deep Black: Space Espionage and National Security

SELECT book.id, book.title FROM book
    WHERE book.title ~* '.*deep.*space.*'
    ORDER BY ???REGEXP_POSITION_FUNCTION???('.*deep.*space.*' in book.title);

DESIRED RESULT
id   title
4    Deep Space Endeavor
6    Deep Black: Space Espionage and National Security
5    Star Trek: Deep Space Nine: The Never Ending Sacrifice

I didn't find any function similar to ???REGEXP_POSITION_FUNCTION???, do you have any ideas?

2
  • Why is row 4 before row 6 in you last result? Both match at position 0. Arbitrary? Or are there more criteria? Commented Jan 14, 2014 at 1:23
  • @ErwinBrandstetter Ideally I'd like longer (more characters) matches to have less "rank", but for now it's not the main problem. Commented Jan 14, 2014 at 1:30

2 Answers 2

4

One way (of many) to do this: Remove the rest of the string beginning at the match and measure the length of the truncated string:

SELECT id, title
FROM   book
WHERE  title ILIKE '%deep%space%'
ORDER  BY length(regexp_replace(title, 'deep.*space.*', '','i'));

Using ILIKE in the WHERE clause, since that is typically faster (and does the same here).
Also note the fourth parameter to the regexp_replace() function ('i'), to make it case insensitive.

Alternatives

As per request in the comment.
At the same time demonstrating how to sort matches first (and NULLS LAST).

SELECT id, title
      ,substring(title FROM '(?i)(^.*)deep.*space.*') AS sub1
      ,length(substring(title FROM '(?i)(^.*)deep.*space.*')) AS pos1

      ,substring(title FROM '(?i)^.*(?=deep.*space.*)') AS sub2
      ,length(substring(title FROM '(?i)^.*(?=deep.*space.*)')) AS pos2

      ,substring(title FROM '(?i)^.*(deep.*space.*)') AS sub3
      ,position((substring(title FROM '(?i)^.*(deep.*space.*)')) IN title) AS p3

      ,regexp_replace(title, 'deep.*space.*', '','i') AS reg4
      ,length(regexp_replace(title, 'deep.*space.*', '','i')) AS pos4
FROM   book
ORDER  BY title ILIKE '%deep%space%' DESC NULLS LAST
         ,length(regexp_replace(title, 'deep.*space.*', '','i'));

You can find documentation for all of the above in the manual here and here.

-> SQLfiddle demonstrating all.

Sign up to request clarification or add additional context in comments.

7 Comments

Nice! Really simple and good solution. You say there can be many other solutions, could you please show me any? I didn't see this solution and can't imagine others even now :)
How can I remove WHERE clause and have the same sorting? I mean I want to get all titles but titles having no match at the end? With this length() I will get very short titles having no match first.
I just edited the question because I've found that I don't have ability to create procedural functions at all (thanks heroku).
ORDER BY title ILIKE '%deep%space%' DESC, length(regexp_replace(title, 'deep.*space.*', '','i')) looks like a solution
Yes, your answer helped a lot. Finally I'm taking the length of string before match and the length of the match, and ordering based on this two values, Thank You very much for inspiration!
|
1

Another way to do this would be to first get the literal match for the pattern, then find the position of the literal match:

strpos(input, (regexp_match(input, pattern, 'i'))[1]);

Or in this case:

SELECT   id, title
FROM     book
ORDER BY strpos(book.title, (regexp_match(book.title, '.*deep.*space.*', 'i'))[1]);

However, there are few caveats:

  1. this is not very efficient as it will scan the input string twice.

  2. this will ignore lookaround (lookbehind, lookahead) constraints, since the literal match can appear multiple times, before the pattern match. e.g: for the input 'aba' and pattern '(?<=b)a', strpos will return 1 (for the 1st 'a') although the actual position should be 3 (for the 2nd 'a').

BTW, you should probably use a greedy quantifier and narrow your character class as much as you can instead of .* to increase performance (e.g 'deep [\w\s]*? space')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.