How to get position of regexp match in string in PostgreSQL?

Question

I have a table with book titles and I want to select books that have title matching a regexp and to order results by the position of the regexp match in title.

It's easy for a single-word searches. E.g.

TABLE book
id   title
1    The Sun
2    The Dead Sun
3    Sun Kissed

I'm going to put .* between words in client's search term before sending query to DB, so I'd write SQL with prepared regexps here.

SELECT book.id, book.title FROM book
    WHERE book.title ~* '.*sun.*'
    ORDER BY COALESCE(NULLIF(position('sun' in book.title), 0), 999999) ASC;

RESULT
id   title
3    Sun Kissed
1    The Sun
2    The Dead Sun

But if search term has more than one word I want to match titles that have all words from search term with anything between them, and sort by the position like before, so I need a function that returns a position of regexp, I didn't find an appropriate one in official PostgreSQL docs.

TABLE books
id   title
4    Deep Space Endeavor
5    Star Trek: Deep Space Nine: The Never Ending Sacrifice
6    Deep Black: Space Espionage and National Security

SELECT book.id, book.title FROM book
    WHERE book.title ~* '.*deep.*space.*'
    ORDER BY ???REGEXP_POSITION_FUNCTION???('.*deep.*space.*' in book.title);

DESIRED RESULT
id   title
4    Deep Space Endeavor
6    Deep Black: Space Espionage and National Security
5    Star Trek: Deep Space Nine: The Never Ending Sacrifice

I didn't find any function similar to ???REGEXP_POSITION_FUNCTION???, do you have any ideas?

Why is row 4 before row 6 in you last result? Both match at position 0. Arbitrary? Or are there more criteria? — Erwin Brandstetter
– Erwin Brandstetter, Commented Jan 14, 2014 at 1:23
@ErwinBrandstetter Ideally I'd like longer (more characters) matches to have less "rank", but for now it's not the main problem. — wobmene
– wobmene, Commented Jan 14, 2014 at 1:30

Community · Accepted Answer · 2020-06-20 09:12:55Z

4

One way (of many) to do this: Remove the rest of the string beginning at the match and measure the length of the truncated string:

SELECT id, title
FROM   book
WHERE  title ILIKE '%deep%space%'
ORDER  BY length(regexp_replace(title, 'deep.*space.*', '','i'));

Using ILIKE in the WHERE clause, since that is typically faster (and does the same here).
Also note the fourth parameter to the regexp_replace() function ('i'), to make it case insensitive.

Alternatives

As per request in the comment.
At the same time demonstrating how to sort matches first (and NULLS LAST).

SELECT id, title
      ,substring(title FROM '(?i)(^.*)deep.*space.*') AS sub1
      ,length(substring(title FROM '(?i)(^.*)deep.*space.*')) AS pos1

      ,substring(title FROM '(?i)^.*(?=deep.*space.*)') AS sub2
      ,length(substring(title FROM '(?i)^.*(?=deep.*space.*)')) AS pos2

      ,substring(title FROM '(?i)^.*(deep.*space.*)') AS sub3
      ,position((substring(title FROM '(?i)^.*(deep.*space.*)')) IN title) AS p3

      ,regexp_replace(title, 'deep.*space.*', '','i') AS reg4
      ,length(regexp_replace(title, 'deep.*space.*', '','i')) AS pos4
FROM   book
ORDER  BY title ILIKE '%deep%space%' DESC NULLS LAST
         ,length(regexp_replace(title, 'deep.*space.*', '','i'));

You can find documentation for all of the above in the manual here and here.

-> SQLfiddle demonstrating all.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jan 14, 2014 at 1:28

Erwin Brandstetter

669k160 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

wobmene Over a year ago

Nice! Really simple and good solution. You say there can be many other solutions, could you please show me any? I didn't see this solution and can't imagine others even now :)

wobmene Over a year ago

How can I remove WHERE clause and have the same sorting? I mean I want to get all titles but titles having no match at the end? With this length() I will get very short titles having no match first.

wobmene Over a year ago

I just edited the question because I've found that I don't have ability to create procedural functions at all (thanks heroku).

wobmene Over a year ago

ORDER BY title ILIKE '%deep%space%' DESC, length(regexp_replace(title, 'deep.*space.*', '','i')) looks like a solution

wobmene Over a year ago

Yes, your answer helped a lot. Finally I'm taking the length of string before match and the length of the match, and ordering based on this two values, Thank You very much for inspiration!

|

gilad905 · Accepted Answer · 2020-01-03 13:17:16Z

Another way to do this would be to first get the literal match for the pattern, then find the position of the literal match:

strpos(input, (regexp_match(input, pattern, 'i'))[1]);

Or in this case:

SELECT   id, title
FROM     book
ORDER BY strpos(book.title, (regexp_match(book.title, '.*deep.*space.*', 'i'))[1]);

However, there are few caveats:

this is not very efficient as it will scan the input string twice.
this will ignore lookaround (lookbehind, lookahead) constraints, since the literal match can appear multiple times, before the pattern match. e.g: for the input 'aba' and pattern '(?<=b)a', strpos will return 1 (for the 1st 'a') although the actual position should be 3 (for the 2nd 'a').

BTW, you should probably use a greedy quantifier and narrow your character class as much as you can instead of .* to increase performance (e.g 'deep [\w\s]*? space')

Collectives™ on Stack Overflow

How to get position of regexp match in string in PostgreSQL?

2 Answers 2

Alternatives

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Alternatives

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related