oracle SQL — Replace comma separated string with another set of values

Question

I have the following table:

B_ID    I_ID    R_ID
W00001  1234    1235,1237
B00001  1235    1236,1235
T00001  1236    1235,1235,1235
X00001  1237    1234,1236,1238
M00001  1238    1238

I need output like below using sql

B_ID    I_ID    R_ID
W00001  1234    B00001|X00001
B00001  1235    T00001|B00001
T00001  1236    B00001
X00001  1237    W00001|T00001|M00001
M00001  1238    M00001

R_ID should match its value with I_ID and pick the corresponding B_ID.
Commas present in R_ID column should replaced with '|' delimiter.

Example: 1st row R_ID has values 1235,1237. 1235 and 1237 is present in I_ID so their corresponding B_ID is picked i.e B00001,X00001 and expected output is B00001|X00001

You shouldn't be storing comma separated values in the first place — user330315
– user330315, Commented Jul 17, 2016 at 20:20

MT0 · Accepted Answer · 2016-07-17 20:11:44Z

1

Without duplicates and does not rely on any magic numbers:

Option 1: Hierarchical Query

Oracle Setup:

CREATE TABLE test_data ( b_id, i_id, r_id ) as
select 'W00001', 1234, '1235,1237'      from dual union all
select 'B00001', 1235, '1236,1235'      from dual union all
select 'T00001', 1236, '1235,1235,1235' from dual union all
select 'X00001', 1237, '1234,1236,1238' from dual union all
select 'M00001', 1238, '1238'           from dual;

Query:

SELECT b_id,
       i_id,
       ( SELECT LISTAGG( t.b_id, '|' ) WITHIN GROUP ( ORDER BY ROWNUM )
         FROM   TABLE( CAST( MULTISET(
                  SELECT DISTINCT
                         TO_NUMBER( REGEXP_SUBSTR( d.r_id, '\d+', 1, LEVEL ) )
                  FROM   DUAL
                  CONNECT BY LEVEL <= REGEXP_COUNT( d.r_id, '\d+' )
                ) AS SYS.ODCINUMBERLIST ) ) v
                INNER JOIN test_data t
                ON (v.COLUMN_VALUE = t.i_id) ) AS r_id
FROM   test_data d;

Explanation

The inner correlated select:

SELECT DISTINCT
       TO_NUMBER( REGEXP_SUBSTR( d.r_id, '\d+', 1, LEVEL ) )
FROM   DUAL
CONNECT BY LEVEL <= REGEXP_COUNT( d.r_id, '\d+' )

Takes the r_id for a single row and and separates it into one row per comma-delimited value; the DISTINCT clause means only unique values are output.

This is converted into a table collection expression using TABLE( CAST( MULTISET( ... ) AS collection_type ) ) so that it can be joined to another table.

This is then self-joined back to test_data to convert from displaying i_ids to b_ids and LISTAGG() is used to re-aggregate the multiple rows back to a single row.

Output:

B_ID         I_ID R_ID
------ ---------- --------------------
W00001       1234 B00001|X00001
B00001       1235 T00001|B00001
T00001       1236 B00001
X00001       1237 W00001|T00001|M00001
M00001       1238 M00001

Option 2: Using PL/SQL

Oracle Setup:

CREATE OR REPLACE TYPE numberlist IS TABLE OF NUMBER;
/

CREATE OR REPLACE FUNCTION split_Number_List(
  i_str    IN  VARCHAR2,
  i_delim  IN  VARCHAR2 DEFAULT ','
) RETURN numberlist DETERMINISTIC
AS
  p_result       numberlist := numberlist();
  p_start        NUMBER(5) := 1;
  p_end          NUMBER(5);
  c_len CONSTANT NUMBER(5) := LENGTH( i_str );
  c_ld  CONSTANT NUMBER(5) := LENGTH( i_delim );
BEGIN
  IF c_len > 0 THEN
    p_end := INSTR( i_str, i_delim, p_start );
    WHILE p_end > 0 LOOP
      p_result.EXTEND;
      p_result( p_result.COUNT ) := TO_NUMBER( SUBSTR( i_str, p_start, p_end - p_start ) );
      p_start := p_end + c_ld;
      p_end := INSTR( i_str, i_delim, p_start );
    END LOOP;
    IF p_start <= c_len + 1 THEN
      p_result.EXTEND;
      p_result( p_result.COUNT ) := TO_NUMBER( SUBSTR( i_str, p_start, c_len - p_start + 1 ) );
    END IF;
  END IF;
  RETURN p_result;
END;
/

Query:

SELECT b_id,
       i_id,
       ( SELECT LISTAGG( t.b_id, '|' ) WITHIN GROUP ( ORDER BY ROWNUM )
         FROM   TABLE( SET( split_Number_List( d.r_id ) ) ) v
                INNER JOIN test_data t
                ON (v.COLUMN_VALUE = t.i_id) ) AS r_id
FROM   test_data d;

(Same output as above)

Option 3:

SELECT b_id,
       i_id,
       ( SELECT LISTAGG( t.b_id, '|' ) WITHIN GROUP ( ORDER BY ROWNUM )
         FROM   test_data t
         WHERE  ',' || d.r_id || ',' LIKE '%,' || t.i_id || ',%'  ) AS r_id
FROM   test_data d;

You can improve performance of this option by using a function-based on ',' || r_id || ',' and on '%,' || i_id || ',%'.

(Same output as above)

edited Jul 17, 2016 at 20:11

answered Jul 10, 2016 at 20:28

MT0

173k12 gold badges70 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

user5683823 Over a year ago

Using connect by level this way will create numerous duplicate rows. You eliminate them with select distinct, but this will result in very poor performance if you had, say, 100000 rows in the input. You need to add a prior clause to the connect by condition.

MT0 Over a year ago

@mathguy Adding a PRIOR clause to the connect by assumes that the values in the list are ordered. This makes no assumptions as to the order of the values (in fact you can see from the sample data that it is not ordered). You could also solve it using the SET() collection operator (rather than DISTINCT) but then you need to specify a UDT rather than SYS.ODCINUMBERLIST.

user5683823 Over a year ago

OK, I will add the proper way to do this to my answer, since apparently it is not widely known or understood. PRIOR in this type of problem does not require ordering; each row in the base table is ONLY connected to itself. (And then you need one more condition, to break the cycle.)

MT0 Over a year ago

@mathguy Using CONNECT BY LEVEL in the way I am doing will not create duplicate rows unless they already exist in the data and there is no need of using SYS_GUID(). What you are talking about is using it with multiple input rows which is not the case here as the DUAL table only has a single row so there is no risk of cross connecting different rows.

user5683823 Over a year ago

OK, I keep reading your query and realizing you are right - your query did not duplicate rows.

|

score 0 · Accepted Answer · 2016-07-11 00:29:10Z

0

In the solution below, I use a standard technique to split each comma-separated string into components (tokens) in factored subquery prep. Then I join back to the original table to replace each token (which is an i_id) with the corresponding b_id, and then put the tokens back together into pipe-separated strings with listagg().

Note: This solution assumes that each r_id has fewer than 100 tokens (see the "magic number" 100 in the definition of idx). If it is known that each r_id will have no more than 9 tokens, then 100 can be changed to 10 (resulting in faster processing). If NO upper bound is known beforehand, you can change 100 to some ridiculously large number; 4000 will do if r_id is anything but a CLOB, as VARCHAR2 and such have a limit of 4000 characters.

Thanks to MT0 for reminding me to add this note.

with test_data ( b_id, i_id, r_id ) as (
       select 'W00001', 1234, '1235,1237'      from dual union all
       select 'B00001', 1235, '1236,1235'      from dual union all
       select 'T00001', 1236, '1235,1235,1235' from dual union all
       select 'X00001', 1237, '1234,1236,1238' from dual union all
       select 'M00001', 1238, '1238'           from dual
     ),
     idx ( n ) as (
       select level from dual connect by level < 100
     ),
     prep ( b_id, i_id, n, token ) as (
       select t.b_id, t.i_id, i.n,
                regexp_substr(t.r_id, '([^,]+)', 1, i.n, null, 1)
       from   test_data t join idx i
                          on i.n <= regexp_count(t.r_id, ',') + 1
     )
select p.b_id, p.i_id, 
       listagg(t.b_id, '|') within group (order by p.n) as r_id
from   prep p join test_data t
              on p.token = t.i_id
group by p.b_id, p.i_id
order by p.i_id;



B_ID         I_ID R_ID
------ ---------- ------------------------------
W00001       1234 B00001|X00001
B00001       1235 T00001|B00001
T00001       1236 B00001|B00001|B00001
X00001       1237 W00001|T00001|M00001
M00001       1238 M00001

Added information based on further conversation with MT0.

I edited this "Added information" again based on even more conversation with MT0. Thank you MT0 for keeping me on my toes!

In the solution below I do away with the magic number 100 and instead I use a common technique to deal with multiple input rows and connect by level from dual. I also show a common technique for dealing with duplicates (in the tokens obtained from the comma-separated input strings).

Query:

with 
     test_data ( b_id, i_id, r_id ) as (
           select 'W00001', 1234, '1235,1237'      from dual union all
           select 'B00001', 1235, '1236,1235'      from dual union all
           select 'T00001', 1236, '1235,1235,1235' from dual union all
           select 'X00001', 1237, '1234,1236,1238' from dual union all
           select 'M00001', 1238, '1238'           from dual
         ),
     prep ( b_id, i_id, n, token ) as (
       select b_id, i_id, level,
                regexp_substr(r_id, '([^,]+)', 1, level, null, 1)
       from   test_data t
       connect by level <= regexp_count(r_id, ',') + 1
              and prior r_id = r_id             -- to only generate the rows needed
              and prior sys_guid() is not null  -- this is unique, to avoid cycles
     ),
     z ( b_id, i_id, n, token, rn ) as (
       select b_id, i_id, n, token,
              row_number() over (partition by i_id, token order by n)
       from prep
     )
select z.b_id, z.i_id, 
       listagg(t.b_id, '|') within group (order by z.n) as r_id
from   z join test_data t
              on z.token = t.i_id
where  z.rn = 1
group by z.b_id, z.i_id
order by i_id;

Result:

B_ID         I_ID R_ID
------ ---------- ------------------------------
W00001       1234 B00001|X00001
B00001       1235 T00001|B00001
T00001       1236 B00001
X00001       1237 W00001|T00001|M00001
M00001       1238 M00001

5 rows selected.

edited Jul 11, 2016 at 0:29

answered Jul 10, 2016 at 12:29

user5683823

10 Comments

Rose Over a year ago

Can you also help me eliminate the duplicates present in R_ID. T00001 1236 B00001|B00001|B00001 to T00001 1236 B00001

MT0 Over a year ago

SYS_GUID() is used the CONNECT BY clause is to prevent cyclic connections caused by multiple input rows. Using a correlated TABLE collection expression prevents this as there is only a single input row and the SYS_GUID() hack is unnecessary.

user5683823 Over a year ago

The use of sys_guid() is indeed a hack, but so is connect by level from dual to generate rows in the first place. Both work (for now - until Oracle will make them not work).

Rose Over a year ago

If I execute the above query with the actual data which has more than 9000 records it takes lot of time for execution..can u suggest anything that can increase the performance??

user5683823 Over a year ago

Is there an index on I_ID? Also: You may want to open a new question (right here on stackoverflow); you may link back to this thread to state the problem and the solutions, which work but take a long time, and ask for help on improving performance. If you do that, make sure to post explain plans (or better yet, execution plans) for what you have tried. I'll give it some thought today too, I will write back if I can come up with anything. But do check about an index on I_ID first, if there isn't one yet that may help.

|

score 0 · Accepted Answer · 2016-07-15 03:41:11Z

Decided to add another answer since it uses an entirely different method - the recursive subquery factoring, available since Oracle version 11.2.

I did some testing, with an input (persistent) table called test_data with 9000 rows, each r_id being a comma-separated string of 200 tokens; structure very similar to the OP's small sample in the original post. I tried three methods: the hierarchical query (using connect by with the prior sys_guid() trick), which I proposed; the solution based on a correlated subquery and nested table, posted by MT0; and the recursive query I will show below. In each case, I used the query as the select... portion of a CTAS statement.

The hierarchical query took 39 minutes to complete.
The correlated subquery and nested table method took 18 minutes.
The recursive subquery factoring method took 13 minutes.

(To compare apples to apples I modified MT0's query, removing the extra information that the "tokens" in r_id are numbers - I treated them as strings, as do the other two methods.)

Recursive query:

with
     prep ( b_id, i_id, str, n, st_pos, end_pos, token) as (
       select  b_id, i_id, ',' || r_id || ',', -1, null, 1, null
         from  test_data
       union all
       select  b_id, i_id, str, n+1, end_pos + 1, instr(str, ',', 1, n+3),
               substr(str, st_pos, end_pos - st_pos)
         from  prep
         where end_pos != 0
     ),
     z ( b_id, i_id, n, token, rn ) as (
       select b_id, i_id, n, token,
              row_number() over (partition by i_id, token order by n)
       from prep
     )
select z.b_id, z.i_id, 
       listagg(t.b_id, '|') within group (order by z.n) as r_id
from   z join test_data t
              on z.token = t.i_id
where  z.rn = 1
group by z.b_id, z.i_id
;

Actually one can squeeze a little bit of extra performance; in the anchor part of the recursive CTE (the first member of the union all in the definition of prep), I could start from n = 0, st_pos = 1 and end_pos = the position of the first comma (actually the second in comma in the altered string; I find it a lot easier to add commas at the beginning and the end of the input CSV string and write the recursive CTE as I did.) However, this saves just one iteration out of 200 for each string; so a 0.5% of execution time could be saved. I find the way I wrote the recursive CTE easier to follow.

For completeness, here is the modified version of the "nested table" method I used (credit @MT0):

select b_id,
       i_id,
       ( select listagg(t.b_id, '|') within group (order by rownum)
         from   table ( 
                        cast ( 
                               multiset (
                                  select distinct regexp_substr(d.r_id, '[^,]+', 1, level) 
                                  from dual
                                  connect by level <= regexp_count(d.r_id, ',') + 1
                               ) 
                               as sys.odcivarchar2list
                        ) 
                ) v
                inner join test_data t
                on (v.column_value = t.i_id)
       )
from   test_data d;

Wow ....this works fine (y) ...I'm new to SQL..Can you please explain me the "nested table" method you have used
I was hoping to make it very clear that the "nested table" method is not mine; it is the solution MT0 suggested. You may want to ask him directly.
@MT0 Can you please explain me the "nested table" method you have used

Collectives™ on Stack Overflow

oracle SQL — Replace comma separated string with another set of values

3 Answers 3

Option 1: Hierarchical Query

Option 2: Using PL/SQL

9 Comments

10 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Option 1: Hierarchical Query

Option 2: Using PL/SQL

9 Comments

10 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related