Get all instances of a substring in a string in PostgreSQL?

Question

I've got a table with entries that are similar to "something XXX010101 somethingelse XXX010102".

I'd like to extract the XXX01... parts from them - possibly using some separator if needed - right now I'm able to easily take out the first one using:

select distinct substring(content from 'XXX[0-9]+'), content from data where content ~ 'XXX[0-9]+'

The one idea I have is to create some monster regex to replace everything that's not an XXX substring... but to be true I hoped (and failed to find) for a simpler solution (like a g flag to substring).

When looking around I found that 8.3 introduced a regex_matches function that seems to be what I'd need - are there any alternatives in 8.2? Or an easy way to get it in 8.2?

How would you go about solving this? Or is an upgrade my best bet? ;) (it's a production system so the downtime and some risk of problems after migration is a source of hesitation ).

Thanks in advance.

-- expected output added --

for

"something XXX010101 something else XXX010102"

I'd like to get:

XXX010101
XXX010102

OR (less preferable)

XXX010101,XXX010102

I was not aware - thanks. We never had any issues so support was not really an issue and I plan to update once we get a downtime "either way" for some other reason (datacenter migration etc (it's planned whithin a couple months)) — RnR
– RnR, Commented Dec 2, 2011 at 13:29
De-Supported means that there will be no more bug fixes (security or otherwise) beginning next year. See here: wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy — user330315
– user330315, Commented Dec 2, 2011 at 13:32
Release policy has been moved to official site: postgresql.org/support/versioning — Erwin Brandstetter
– Erwin Brandstetter, Commented Apr 12, 2012 at 4:45

Erwin Brandstetter · Accepted Answer · 2011-12-02 21:59:13Z

I am somewhat hesitant to even post my answer. You really must upgrade. Version 8.2 reaches end of live right now. Follow the link @a_horse_with_no_name posted.

However, the problem got my attention. The following demo should work with PostgreSQL 8.2:

SELECT -- content,
         rtrim(
         regexp_replace(
         regexp_replace(
         replace(
         substring(content
        ,E'(XXX\\d+|XXX\\d+.*XXX\\d+)')  -- kill leading / trailing noise
        ,',',' ')                        -- remove all ","
        ,E'(XXX\\d+)', E'\\1,', 'g')     -- terminate X-strings with ","
        -- now we can use non-greedy terminated with ","
        ,E'(XXX\\d+?,)*?.*?(XXX\\d+?,)', E'\\1\\2', 'g')
        ,',') AS result
FROM    (VALUES
  ('no match')
 ,('XXX010101')
 ,('XXX010102 beginn')
 ,('end XXX010103')
 ,('middle XXX010104 match')
 ,('match middle XXX010105 and end XXX010106')
 ,('XXX010107, match beginn XXX010108 & middle')
 ,('XXX010109 begin and end XXX010110')
 ,('XXX01011 begin XXX010112 middle and end XXX010113')
 ,('XXX01014 begin XXX010115 close matches XX010113 XXXy010113 23624 ,XXX010116')
 ,('f XXX01017 B XXX010118 457XXX010119 XXXy XXX010120 overkill XXX010121end')
) data(content)

Result:

                     result
--------------------------------------------------
             -- first line is NULL
 XXX010101
 XXX010102
 XXX010103
 XXX010104
 XXX010105,XXX010106
 XXX010107,XXX010108
 XXX010109,XXX010110
 XXX01011,XXX010112,XXX010113
 XXX01014,XXX010115,XXX010116
 XXX01017,XXX010118,XXX010119,XXX010120,XXX010121

Some explanation:

no regex_matches() in version 8.2 as OP already stated
but there is regexp_replace() which can use a g flag (replace globally)
we cannot mix greedy and non-greedy quantifiers in the same regular expression
so I terminate the wanted strings with a ,, after removing all other occurances of , Could be any character not part of the wanted strings, but , can serve as separator in the result.
first cut leading and trailing noise
then globally replace everything in between wanted strings.
to make that work use (XXX\\d+?,)*? to capture any number of leading occurances of wanted string.
final rtrim() removes trailing ,
In PsotgreSQL 8.3+ you could use regexp_split_to_table() to split the wanted strings into single rows. In 8.2 you'll have to come up with something of your own. I would write a plgpsql function ...

This makes heavy use of a features of PostgreSQL's POSIX Regular Expressions (link to version 8.2!)

xQbert · Accepted Answer · 2011-12-02 14:18:04Z

1

How about something like this (assuming the values you're looking for are contained in a separate table)... I'll make no claims as to good performance though...

Select A.text, B.Text2, B.Val 
FROM A 
INNER JOIN B ON B.Text2 LIKE ('%' || A.Text || '%')

let table A contain all possible combinations of XXX010101... that your looking for

text
XXX010101
XXX010102
XXX010103

let table B contain all the search texts you want

text 2                                              val
something XXX010101 something else XXX010102        1
yet another XXX010102 and this XXX010103            2
XXX010105                                           3
XXX010103                                           4

Results:

text            text2                                           VAL
XXX010101   something XXX010101 something else XXX010102    1
XXX010102   something XXX010101 something else XXX010102    1
XXX010102   yet another XXX010102 and this XXX010103        2
XXX010103   yet another XXX010102 and this XXX010103        2
XXX010103   XXX010103                                       4

--------WRONG but misunderstood the question...

What's wrong with the replace function? http://www.postgresql.org/docs/8.2/interactive/functions-string.html

replace( 'abcdefabcdef', 'cd', '')

so the characters are set to empty string.

edited Dec 2, 2011 at 14:18

answered Dec 2, 2011 at 12:38

xQbert

35.5k2 gold badges46 silver badges67 bronze badges

4 Comments

RnR Over a year ago

To be true it's mostly that the regex would be hard to read and "logically reversed" and I'd need to parse the output either way on the client side - and in that case I can do this on the client side altogether and it'll be much more readable too I think? (while ($content =~ m/XXX[0-9]{6}/g) ... ).

xQbert Over a year ago

I guess I don't understand the question then. Can you update the question to include Sample input/output that you want?

xQbert Over a year ago

Yep, I totally miss understood the question :D

xQbert Over a year ago

Is there by chance a table which contains all possible values of the XXX010101, XXX010102...

Evan Carroll · Accepted Answer · 2017-03-10 06:33:34Z

1

The fastest method is to use plperlu which works with 8.2.

CREATE LANGUAGE plperl

CREATE FUNCTION get_things(inputStr text)
RETURNS SETOF text
AS $BODY$
  return \@{[ $_[0] =~ m/(XXX\d{6})/g ]} 
$BODY$
LANGUAGE plperl
IMMUTABLE;

SELECT get_things(x)
FROM ( VALUES
  ('XXX010101 somethingelse XXX010102')
) AS t(x);

 get_things 
------------
 XXX010101
 XXX010102
(2 rows)

It also works with new versions of PostgreSQL.

edited Mar 10, 2017 at 6:33

answered Mar 10, 2017 at 6:27

Evan Carroll

1

Comments

Vérace · Accepted Answer · 2024-05-01 18:38:20Z

Late to the party, but (code below is available here):

CREATE TABLE tab
(
  s_id INT  NOT NULL GENERATED ALWAYS AS IDENTITY,
  str  TEXT NOT NULL
);

Populate:

INSERT INTO tab (str) VALUES
 ('no match')
 ,('XXX010101')
 ,('XXX010101 XXX020202')
 ,('XXX010102 beginn')
 ,('end XXX010101')
 ,('middle XXX010101 match')
 ,('match middle XXX010102 and end XXX010102')
 ,('XXX010107, match beginn XXX010108 & middle')
 ,('XXX010109 begin and end XXX010110')
 ,('XXX01011 begin XXX010112 middle and end XXX010113')
 ,('XXX01014 begin XXX010115 close matches XX010113 XXXy010113 23624 ,XXX010116')
 ,('f XXX01017 B XXX010118 457XXX010119 XXXy XXX010120 overkill XXX010121end');

and then we run:

SELECT
  s_id,
  UNNEST(REGEXP_MATCHES(str, 'XXX\d{4,6}', 'g')),
  PG_TYPEOF(UNNEST(REGEXP_MATCHES(str, 'XXX\d{4,6}', 'g')))  -- not required
                                                             -- for testing
  str
FROM
  tab
ORDER BY s_id;

Result:

s_id    unnest  str
2   XXX010101   text
3   XXX010101   text
3   XXX020202   text
4   XXX010102   text
5   XXX010101   text
6   XXX010101   text
7   XXX010102   text
7   XXX010102   text
...
... snipped for brevity
...
12  XXX010118   text
12  XXX010119   text
12  XXX010120   text
12  XXX010121   text

You can also use (probably more performant - see fiddle):

  (REGEXP_MATCHES(str, 'XXX\d{4,6}', 'g'))[1],

If you have a .csv string, you can turn it into the result above using REGEXP_SPLIT_TO_TABLE().

SELECT
  REGEXP_SPLIT_TO_TABLE('XXX01017,XXX010118,XXX010119,XXX010120,XXX010121', ',');

Result:

regexp_split_to_table
XXX01017
XXX010118
XXX010119
XXX010120
XXX010121

Collectives™ on Stack Overflow

Get all instances of a substring in a string in PostgreSQL?

4 Answers 4

Some explanation:

Comments

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Some explanation:

Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related