0

Using PostgreSQL, I am unable to design the correct regex pattern to achieve the desired output of an SQL statement that uses regexp_replace.

My source text consists of several scattered blocks of text of the form 'PU*' followed by a date string in the form of 'YYYY-MM'--for example, 'PU*2020-11'. These blocks are surrounded by strings of unpredictable, arbitrary text (including other instances of 'PU*' followed by the above date string format, such as 'PU*2017-07), white space, and line feeds.

My desire is to replace the entire source text with the FIRST instance of the 'YYYY-MM' text pattern. In the above example, the desired output would be '2020-11'.

Currently, my search pattern results in the correct replacement text in place of the first capturing group, but unfortunately, all of the text AFTER the first capturing group also inadvertently appears in the output, which is not the desired output.

Specifically:

Version: postgres (PostgreSQL) 13.0

A more complex example of source text:

First line
Exec committee
PU*2020-08
PU*2019-09--cancelled
PU*2017-10

added by Terranze

My pattern so far:

(\s|\S)*?PU\*(\d{4}-\d{2})(\s|\S*)*

Current SQL statement:

select regexp_replace('First line\nExec committee; PU*2020-08\nPU*2019-09\nPU*2017-10\n\nadded by Terranze\n', '(\s|\S)*?PU\*(\d{4}-\d{2})(\s|\S*)*', '\2') as _regex;

Current output on https://regex101.com/

2020-08

Current output on psql

                              _regex                               
───────────────────────────────────────────────────────────────────
 2020-08\nPU*2019-09--cancelled\nPU*2017-10\n\nadded by Terranze\n
(1 row)

Desired output:

2020-08

Any help appreciated. Thanks--

1 Answer 1

1

How about this expression:

'^.*?PU\*(\d{4}-\d{2}).*$'

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for your quick response! Well, I realized that I made a mistake in the way that I described the problem. My sample complex source text contains line feeds, and I thought that putting '\n's in the source text would simulate actual line feeds, but I realize now that it does not. The regex engine is intepreting those as literal '\n'.
So I updated the source text to be truly multi-line, and your pattern does not work in this case. Do you know another pattern that might work? Thanks--
Sorry, I take it back; your solution works! I know that ^ marks the beginning of a string and $ marks the end, but I cannot figure out why they are necessary in this case.
Great @WDock. Then, could you then please tag the awswer as accepted. :)
Regarding your questions, the ^ and $ are needed to, first, to create a fixed anchors so that the capture is done in the "greediest" way possible in the end of the string and in the opposite approach in the start. Hope this is clear

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.