Postgres regexp_replace: inability to replace source text with first captured group

Question

Using PostgreSQL, I am unable to design the correct regex pattern to achieve the desired output of an SQL statement that uses regexp_replace.

My source text consists of several scattered blocks of text of the form 'PU*' followed by a date string in the form of 'YYYY-MM'--for example, 'PU*2020-11'. These blocks are surrounded by strings of unpredictable, arbitrary text (including other instances of 'PU*' followed by the above date string format, such as 'PU*2017-07), white space, and line feeds.

My desire is to replace the entire source text with the FIRST instance of the 'YYYY-MM' text pattern. In the above example, the desired output would be '2020-11'.

Currently, my search pattern results in the correct replacement text in place of the first capturing group, but unfortunately, all of the text AFTER the first capturing group also inadvertently appears in the output, which is not the desired output.

Specifically:

Version: postgres (PostgreSQL) 13.0

A more complex example of source text:

First line
Exec committee
PU*2020-08
PU*2019-09--cancelled
PU*2017-10

added by Terranze

My pattern so far:

(\s|\S)*?PU\*(\d{4}-\d{2})(\s|\S*)*

Current SQL statement:

select regexp_replace('First line\nExec committee; PU*2020-08\nPU*2019-09\nPU*2017-10\n\nadded by Terranze\n', '(\s|\S)*?PU\*(\d{4}-\d{2})(\s|\S*)*', '\2') as _regex;

Current output on https://regex101.com/

2020-08

Current output on psql

                              _regex                               
───────────────────────────────────────────────────────────────────
 2020-08\nPU*2019-09--cancelled\nPU*2017-10\n\nadded by Terranze\n
(1 row)

Desired output:

2020-08

Any help appreciated. Thanks--

PandaCheLion · Accepted Answer · 2020-11-23 02:44:05Z

1

How about this expression:

'^.*?PU\*(\d{4}-\d{2}).*$'

answered Nov 23, 2020 at 2:44

PandaCheLion

4761 gold badge5 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

WDock Over a year ago

Thanks for your quick response! Well, I realized that I made a mistake in the way that I described the problem. My sample complex source text contains line feeds, and I thought that putting '\n's in the source text would simulate actual line feeds, but I realize now that it does not. The regex engine is intepreting those as literal '\n'.

WDock Over a year ago

So I updated the source text to be truly multi-line, and your pattern does not work in this case. Do you know another pattern that might work? Thanks--

WDock Over a year ago

Sorry, I take it back; your solution works! I know that ^ marks the beginning of a string and $ marks the end, but I cannot figure out why they are necessary in this case.

PandaCheLion Over a year ago

Great @WDock. Then, could you then please tag the awswer as accepted. :)

PandaCheLion Over a year ago

Regarding your questions, the ^ and $ are needed to, first, to create a fixed anchors so that the capture is done in the "greediest" way possible in the end of the string and in the opposite approach in the start. Hope this is clear

Collectives™ on Stack Overflow

Postgres regexp_replace: inability to replace source text with first captured group

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related