1

I have column in Postgres db which has text in char varying data type. The text includes an uri which contains file name and resembles as below;

  The file is a file of \\88-77-99-666.abc.example.com\Folder1\Folder2\Folder3\Folder4\20221122\12345678.PDF [9bc8rer55c655f4cb5df763c61862d3fdde9557b0] is the sha1 of the file.

I am trying to get the file name 12345678.PDF and date 20221122 from the text content. However, regexp_replace either gives me everything till file name or everything after filename. I am trying to get only file name

1>> Regexp_replace(data, '.+\\', '')

Yields filename and everything after it

 2>> Regexp_replace(data, '\[.*', '')

Yields filename and everything after it

If I capture two groups like below I get same result as 1.

Regexp_replace(data, '.+\\|\[', '')

How can I substitute 2 groups and only get filename? Or what is the better way to achieve this? And I need to get the date value but if I can figure this out maybe I will be able to apply the learning for to extract date value. Thanks for your time.

11
  • You're running a replace function, so you'll need to capture the part that you want to keep and replace the rest of the string with it. Try something like Regexp_replace(data, '.+\\(.+)`.*', '\1') Commented Nov 23, 2022 at 1:54
  • I tried it but I getting the full string back. I tried this substring(data from '\w*.PDF') which returns the desired results but if the extension is not PDF then I am not getting the result. I could use \w*\.[aA-zZ] but the string has domain as example.vpc.com` resulting in undesired result. Trying to figure out how to further qualify the substring to get extensions such as Pdf, pdf, DOC, doc and its likes Commented Nov 23, 2022 at 5:44
  • Maybe REGEXP_MATCHES(col, "`([^`]+)` *\[([^][]+)")? Commented Nov 23, 2022 at 8:37
  • @WiktorStribiżew I tried your suggestion and getting null results Commented Nov 23, 2022 at 18:43
  • @Alsheik It works here. Commented Nov 23, 2022 at 19:03

1 Answer 1

1

You can use

SELECT REGEXP_MATCHES(
  'The file is a file of \\88-77-99-666.abc.example.com\Folder1\Folder2\Folder3\Folder4\20221122\2779780.PDF [9bc8rer55c655f4cb5df763c61862d3fdde9557b0] is the sha1 of the file.',
  '([^[:space:]\\/]+)\s+\[([^][]+)') AS Result;

See the DB fiddle, result:

enter image description here

Details:

  • ([^[:space:]\\/]+) - Group 1: one or more chars other than \, / and whitespace
  • \s+ - one or more whitespaces
  • \[ - a [ char
  • ([^][]+) - Group 2: one or more chars other than [ and ].
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks @WiktorStribiżew, your answer got me what I wanted with slight change ([^[:space:]\\/]+)\s+\[ as it gets the required filename.pdf portion and I decided to go with substring vs regex_matches to avoid the curly braces, SQLize. Again thanks for the help. Now on to figuring out how to extract the date folder before \filename.pdfwill update the thread once I find the answer.
To find the date folder, '`\([0-9]{8})\`' gives me the desired results as its in YYYYMMDD. SQLize. Hope this helps someone.
Unable to get the 2xbackslashes in my previous comment, refer to SQLize link for answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.