Distinct values from regexp_replace in oracle not working

Question

I am trying to extract distinct values from pipe delimited text. When i searched in google i got the below expression but it is not working in some cases

EG:

   select regexp_replace('Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|Baker|Baker|Baker', '([^|]+)(\|\1)+', '\1') from dual;

Expected Output:

Bhal|Bhaloo|Bhaloooo|Baker

I have tried some combinations in the regex but it is not working for me.

Any help would be appreciated.

updated the question. Thanks

arunb2w
– arunb2w

2015-10-27 14:42:47 +00:00
Commented Oct 27, 2015 at 14:42 — arunb2w
– arunb2w, Commented Oct 27, 2015 at 14:42
In PCRE, it would be \b([^|]+\b)(\|\1\b)+.

Wiktor Stribiżew
– Wiktor Stribiżew

2015-10-27 14:44:59 +00:00
Commented Oct 27, 2015 at 14:44 — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 27, 2015 at 14:44
what does \b refers here?

arunb2w
– arunb2w

2015-10-27 14:48:44 +00:00
Commented Oct 27, 2015 at 14:48 — arunb2w
– arunb2w, Commented Oct 27, 2015 at 14:48
But it is not working in sql

arunb2w
– arunb2w

2015-10-27 15:01:20 +00:00
Commented Oct 27, 2015 at 15:01 — arunb2w
– arunb2w, Commented Oct 27, 2015 at 15:01
Check this article.

Wiktor Stribiżew
– Wiktor Stribiżew

2015-10-27 15:14:34 +00:00
Commented Oct 27, 2015 at 15:14 — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 27, 2015 at 15:14

Gary_W · Accepted Answer · 2015-10-27 21:05:46Z

1

This one sure is a challenge. First understand why the original was failing. The first string found of 'Bhal' also was the first part of the second string 'Bhaloo'. So the part of the string matched by the original regex of '([^|]+)(\|\1)+' (read as: match a group of one or more characters that are not a pipe followed by one or more groups consisting of a pipe followed by the string remembered in the first group) included the first 4 characters of the first occurrence of Bhaloo, causing the regex engine to consume those characters from the string as it was processed. The same for the remaining patterns found. The key is to include the ending pattern too, which would be the ending pipe or the end-of-line character if the regex engine is at the end of the string. Here I added the ending pattern group of (\||$) which reads as 'where followed by a pipe or the end of the line'. This ensures if the string happens to match the beginning of the next string, it will not be consumed by the regex engine. Then the replace pattern adds the end string as \3 to ensure it gets printed in the output (basically adds it back since it got consumed by checking for it).

SQL> select regexp_replace('ABhal|Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|||||Baker|Baker|Baker',
  2                        '([^|]*)(\|\1)*(\||$)', '\1\3') as unique_values
  3  from dual;

UNIQUE_VALUES
---------------------------------
ABhal|Bhal|Bhaloo|Bhaloooo||Baker

SQL>

EDIT: Slight tweak handles NULLS when in between other values. Not sure how useful this really is. Changed test case. Also changed the regex to match zero or more instead of one or more (asterisk instead of the plus sign).

Caveats:

I took my own advice and tested with unexpected values. Always expect the unexpected! Perhaps these could be factors for you?

This expects the list to already be in order. i.e. if there is another 'Bhal' at the end, it will be treated as a new value.

Nulls are not handled gracefully either. Well, sort of. Changed test case above to illustrate.

edited Oct 27, 2015 at 21:05

answered Oct 27, 2015 at 20:03

Gary_W

10.4k1 gold badge26 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2404501 Over a year ago

The lack of anchoring at the beginning is still a problem. Try adding ABhal| to the start of the string for a harder test case.

Gary_W Over a year ago

I don't believe it's beginning anchoring as much as allowing for zero or more matches (the asterisk) as opposed to one or more matches (the plus sign). I amended my example. Good catch, thanks!

Andres · Accepted Answer · 2015-10-27 18:45:30Z

0

I had to add a | at the end of the string to make it work, so it's not the most elegant solution, but I believe it works:

select rtrim(regexp_replace('Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Baker|Baker|Baker'||'|'
                    , '([^|]+\|)(\1)+', '\1'),'|')from dual

answered Oct 27, 2015 at 18:45

Andres

731 silver badge7 bronze badges

Comments

William Robertson · Accepted Answer · 2015-10-27 18:57:14Z

0

I think the problem is that it is only looking for:

(string of non-pipe characters)(a pipe character)(the string found at \1)

which would be a partial match in the case of abc|abcd.

This almost works:

select regexp_replace(
         'Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|Baker|Baker|Baker'
       , '([^|]+)(\|)(\1\|)+'
       , '\1|' )
from   dual;

although it doesn't catch the final Baker as it's not followed by a pipe. If you don't mind concatenating one more pipe character onto the end of your source string and cleaning up the output you're there.

answered Oct 27, 2015 at 18:57

William Robertson

16.1k4 gold badges41 silver badges49 bronze badges

Comments

ramana_k · Accepted Answer · 2015-10-28 02:01:17Z

0

The problem has already been well identified and analyzed by other answers. So I am just adding another possible solution here. At least, for the test case given in the question, this produces expected output.

select regexp_replace('Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|Baker|Baker|Baker', '(.+?)(\|)((\1(\2|$))+)', '\1\5') from dual

Brief explanation: Note that Capture groups are numbered by the opening parenthesis at the start of the group.

   (  )  (  )  (  (  (  )  )  ) 
   1     2     3  4  5

Here, group 5 is contained in group 4 which in turn is contained in group 3.

Capture group 1 -> (.+?) Match one or more characters. This is non-greedy, so stops when there is a match for the next part of regex.

The expression given in the question [^|]+ works as well.
This effectively matches one of the words in the string.

Capture group 2 -> ( \| ) Match the delimiter, which is a literal '|'

Capture group 3 -> ( (\1 ( \2 | $ ) )+ ) This contains group 4 which in turn contains group 5. This matches a sequence of "one of the words in the string followed by either a delimiter or end of the string"

Capture group 4 -> (\1 (\2 | $) ) The actual word matched in group 1, followed by delimiter (which is group 2) or end of the string

Capture group 5 -> ( \2 | $) Matches the delimiter '|' or end of the string

edited Oct 28, 2015 at 2:01

answered Oct 28, 2015 at 1:13

ramana_k

1,9332 gold badges11 silver badges15 bronze badges

2 Comments

Gary_W Over a year ago

Alas, it has troubles where a single value exists at the start of the string and with NULL list elements: ''ABhal|Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|||||Baker|Baker|Baker''. Perhaps it's not fair to beat our brains out over the myriad possible combinations of list elements. The OP never stated what the expected data should be like. Although one should always expect the unexpected. :-/

ramana_k Over a year ago

@Gary_W, I totally agree with you. It is indeed very tricky to cover all these cases with regular expressions alone. That too with Oracle's regex which does not support all the features provided by, say, Perl. So you are right on when you say "not fair to beat our brains over this" :)

Mark Van Alstyne · Accepted Answer · 2017-12-08 18:21:12Z

I combined a few ideas, and now use a function that returns a distinct sorted list of unique values from a string. This method doesn't require the list to be already sorted as the other answers do.

This SQL could also be utilized within a sub-select rather than a function.

    function UniqueList (cList varchar2, cNewItem varchar2 default '', cDelim varchar2 default ',')
return varchar2 
is
  cResult varchar2(4000);
begin 
  select distinct listagg(txt,cDelim) WITHIN GROUP (ORDER BY txt) OVER () into cResult
    from (
    select distinct * from (                    
        SELECT REGEXP_SUBSTR (cList||cDelim||cNewItem,'[^'||cDelim||']+',1,LEVEL) TXT
                 FROM DUAL
           CONNECT BY REGEXP_SUBSTR (cList||cDelim||cNewItem,'[^'||cDelim||']+',1,LEVEL)
                 IS NOT NULL
        )
   ); 
    return cResult;
end;

Collectives™ on Stack Overflow

Distinct values from regexp_replace in oracle not working

5 Answers 5

2 Comments

Comments

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related