2

I am trying to extract distinct values from pipe delimited text. When i searched in google i got the below expression but it is not working in some cases

EG:

   select regexp_replace('Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|Baker|Baker|Baker', '([^|]+)(\|\1)+', '\1') from dual;

Expected Output:

Bhal|Bhaloo|Bhaloooo|Baker

I have tried some combinations in the regex but it is not working for me.

Any help would be appreciated.

5
  • updated the question. Thanks Commented Oct 27, 2015 at 14:42
  • In PCRE, it would be \b([^|]+\b)(\|\1\b)+. Commented Oct 27, 2015 at 14:44
  • what does \b refers here? Commented Oct 27, 2015 at 14:48
  • But it is not working in sql Commented Oct 27, 2015 at 15:01
  • Check this article. Commented Oct 27, 2015 at 15:14

5 Answers 5

1

This one sure is a challenge. First understand why the original was failing. The first string found of 'Bhal' also was the first part of the second string 'Bhaloo'. So the part of the string matched by the original regex of '([^|]+)(\|\1)+' (read as: match a group of one or more characters that are not a pipe followed by one or more groups consisting of a pipe followed by the string remembered in the first group) included the first 4 characters of the first occurrence of Bhaloo, causing the regex engine to consume those characters from the string as it was processed. The same for the remaining patterns found. The key is to include the ending pattern too, which would be the ending pipe or the end-of-line character if the regex engine is at the end of the string. Here I added the ending pattern group of (\||$) which reads as 'where followed by a pipe or the end of the line'. This ensures if the string happens to match the beginning of the next string, it will not be consumed by the regex engine. Then the replace pattern adds the end string as \3 to ensure it gets printed in the output (basically adds it back since it got consumed by checking for it).

SQL> select regexp_replace('ABhal|Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|||||Baker|Baker|Baker',
  2                        '([^|]*)(\|\1)*(\||$)', '\1\3') as unique_values
  3  from dual;

UNIQUE_VALUES
---------------------------------
ABhal|Bhal|Bhaloo|Bhaloooo||Baker

SQL>

EDIT: Slight tweak handles NULLS when in between other values. Not sure how useful this really is. Changed test case. Also changed the regex to match zero or more instead of one or more (asterisk instead of the plus sign).

Caveats:

I took my own advice and tested with unexpected values. Always expect the unexpected! Perhaps these could be factors for you?

This expects the list to already be in order. i.e. if there is another 'Bhal' at the end, it will be treated as a new value.

Nulls are not handled gracefully either. Well, sort of. Changed test case above to illustrate.

Sign up to request clarification or add additional context in comments.

2 Comments

The lack of anchoring at the beginning is still a problem. Try adding ABhal| to the start of the string for a harder test case.
I don't believe it's beginning anchoring as much as allowing for zero or more matches (the asterisk) as opposed to one or more matches (the plus sign). I amended my example. Good catch, thanks!
0

I had to add a | at the end of the string to make it work, so it's not the most elegant solution, but I believe it works:

select rtrim(regexp_replace('Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Baker|Baker|Baker'||'|'
                    , '([^|]+\|)(\1)+', '\1'),'|')from dual

Comments

0

I think the problem is that it is only looking for:

(string of non-pipe characters)(a pipe character)(the string found at \1)

which would be a partial match in the case of abc|abcd.

This almost works:

select regexp_replace(
         'Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|Baker|Baker|Baker'
       , '([^|]+)(\|)(\1\|)+'
       , '\1|' )
from   dual;

although it doesn't catch the final Baker as it's not followed by a pipe. If you don't mind concatenating one more pipe character onto the end of your source string and cleaning up the output you're there.

Comments

0

The problem has already been well identified and analyzed by other answers. So I am just adding another possible solution here. At least, for the test case given in the question, this produces expected output.

select regexp_replace('Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|Baker|Baker|Baker', '(.+?)(\|)((\1(\2|$))+)', '\1\5') from dual

Brief explanation: Note that Capture groups are numbered by the opening parenthesis at the start of the group.

   (  )  (  )  (  (  (  )  )  ) 
   1     2     3  4  5   

Here, group 5 is contained in group 4 which in turn is contained in group 3.

Capture group 1 -> (.+?) Match one or more characters. This is non-greedy, so stops when there is a match for the next part of regex.

The expression given in the question [^|]+ works as well.
This effectively matches one of the words in the string.

Capture group 2 -> ( \| ) Match the delimiter, which is a literal '|'

Capture group 3 -> ( (\1 ( \2 | $ ) )+ ) This contains group 4 which in turn contains group 5. This matches a sequence of "one of the words in the string followed by either a delimiter or end of the string"

Capture group 4 -> (\1 (\2 | $) ) The actual word matched in group 1, followed by delimiter (which is group 2) or end of the string

Capture group 5 -> ( \2 | $) Matches the delimiter '|' or end of the string

2 Comments

Alas, it has troubles where a single value exists at the start of the string and with NULL list elements: ''ABhal|Bhal|Bhal|Bhal|Bhaloo|Bhaloo|Bhaloo|Bhaloooo|Bhaloooo|Bhaloooo|||||Baker|Baker|Baker''. Perhaps it's not fair to beat our brains out over the myriad possible combinations of list elements. The OP never stated what the expected data should be like. Although one should always expect the unexpected. :-/
@Gary_W, I totally agree with you. It is indeed very tricky to cover all these cases with regular expressions alone. That too with Oracle's regex which does not support all the features provided by, say, Perl. So you are right on when you say "not fair to beat our brains over this" :)
0

I combined a few ideas, and now use a function that returns a distinct sorted list of unique values from a string. This method doesn't require the list to be already sorted as the other answers do.

This SQL could also be utilized within a sub-select rather than a function.

    function UniqueList (cList varchar2, cNewItem varchar2 default '', cDelim varchar2 default ',')
return varchar2 
is
  cResult varchar2(4000);
begin 
  select distinct listagg(txt,cDelim) WITHIN GROUP (ORDER BY txt) OVER () into cResult
    from (
    select distinct * from (                    
        SELECT REGEXP_SUBSTR (cList||cDelim||cNewItem,'[^'||cDelim||']+',1,LEVEL) TXT
                 FROM DUAL
           CONNECT BY REGEXP_SUBSTR (cList||cDelim||cNewItem,'[^'||cDelim||']+',1,LEVEL)
                 IS NOT NULL
        )
   ); 
    return cResult;
end;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.