5

With a table like:

       uid               |            segmentids
-------------------------+----------------------------------------
 f9b6d54b-c646-4bbb-b0ec | 4454918|4455158|4455638|4455878|4455998
 asd7a0s9-c646-asd7-b0ec | 1265899|1265923|1265935|1266826|1266596
 gd3355ff-cjr8-assa-fke0 | 2237557|2237581|2237593
 laksnfo3-kgi5-fke0-b0ec | 4454918|4455158|4455638|4455878

How to create a new table with:

      uid               |         segmentids
-------------------------+---------------------------
 f9b6d54b-c646-4bbb-b0ec |           4454918
 f9b6d54b-c646-4bbb-b0ec |           1265899
 f9b6d54b-c646-4bbb-b0ec |           2237557
 f9b6d54b-c646-4bbb-b0ec |           4454918
 f9b6d54b-c646-4bbb-b0ec |           4454918
 asd7a0s9-c646-asd7-b0ec |           1265899
 asd7a0s9-c646-asd7-b0ec |           1265923
 asd7a0s9-c646-asd7-b0ec |           1265935 
 asd7a0s9-c646-asd7-b0ec |           1266826
 asd7a0s9-c646-asd7-b0ec |           1266596

The number of segments are dynamic, can vary with each record. I tried the Split function with delimiter, but it requires the index in string, which is dynamic here.

Any suggestions?

3
  • You tag several dbms products, how come? Commented Oct 17, 2017 at 7:30
  • what is dynamic here? you are using index =1 in your example - show some different sample to explain please Commented Oct 17, 2017 at 7:32
  • i understand the question - please change your tags to be just the one you are using as the answer is completely different! Commented Oct 17, 2017 at 7:35

3 Answers 3

12

Here is the Redshift answer, it will work with up to 10 thousand segment ids values per row.

test data

create table test_split (uid varchar(50),segmentids varchar(max));
insert into test_split
    values
      ('f9b6d54b-c646-4bbb-b0ec','4454918|4455158|4455638|4455878|4455998'),
      ('asd7a0s9-c646-asd7-b0ec','1265899|1265923|1265935|1266826|1266596'),
      ('asd7345s9-c646-asd7-b0ec','1235935|1263456|1265675696'),
      ('as345a0s9-c646-asd7-b0ec','12765899|12658883|12777935|144466826|1266226|12345')
;

code

with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
  , generted_numbers AS
(
    SELECT (1000 * t1.num) + (100 * t2.num) + (10 * t3.num) + t4.num AS gen_num
    FROM ten_numbers AS t1
      JOIN ten_numbers AS t2 ON 1 = 1
      JOIN ten_numbers AS t3 ON 1 = 1
      JOIN ten_numbers AS t4 ON 1 = 1
)
  , splitter AS
(
    SELECT *
    FROM generted_numbers
    WHERE gen_num BETWEEN 1 AND (SELECT max(REGEXP_COUNT(segmentids, '\\|') + 1)
                                 FROM test_split)
)
  --select * from splitter;
  , expanded_input AS
(
    SELECT
      uid,
      split_part(segmentids, '|', s.gen_num) AS segment
    FROM test_split AS ts
      JOIN splitter AS s ON 1 = 1
    WHERE split_part(segmentids, '|', s.gen_num) <> ''
)
SELECT * FROM expanded_input;

the first 2 cte steps (ten_numbers and generated_numbers) are used to generate a number of rows, this is needed because generate_series is not supported

The next step (splitter) just takes a number of rows equal to the max number of delimiters + 1 (which is the max number of segments)

finally, we cross join splitter with the input data, take the related value using split_part and then exclude blank parts (which are caused where the row has < the max number of segments)

Sign up to request clarification or add additional context in comments.

5 Comments

Working like a charm. Still trying to figure out how. I get the indexing part, but how are you matching it with uids as in the original table?
And why are we using values like 1000, 100 and 10 in ten_numbers?
it is generating 10000 rows of output. with gen_num incrementing for each row. normally you could use generate_series function for this but that is not possible in redshift so this is a clever workaround. if you dont need 10000 then you could remove that part. there is virtually no overhead in doing this though.
The matching is done as a cross join (FROM test_split AS ts JOIN splitter AS s ON 1 = 1). every test_split gets merged with every splitter row. we only have splitter rows from 1 up to the max number of segments anywhere in the file.
Its beautiful however -- in my time I've come to learn the value immediately readable code. The proper solution is to include a stage in the pipeline that makes functional transformations in spark or similar. Then this difficult logic becomes df.withColumn("segment_id", functions.explode("seg_array")), after converting the delimited column to any array via select(split(col("segmentids"),"|").alias("seg_array"))
3

You can iterate over the SUPER array returned by split_to_array -- see the "Unnesting and flattening" section of this post. Using the same test_split table as the previous answer:

WITH seg_array AS
  (SELECT uid,
          split_to_array(segmentids, '|') segs
   FROM test_split)
SELECT uid,
       segmentid::int
FROM seg_array a,
     a.segs AS segmentid;

Comments

1

Redshift now has the super data type & the split_to_array function which is similar to postgresql string_to_array

Redshift now also supports unnesting arrays through a syntax similar to a LATERAL JOIN in postgresql.

Using these techniques, we may write the same transformation in 2022 as

WITH split_up AS (
SELECT
  uid
, split_to_array(segmentids) segment_array
)
SELECT
  su.uid
, CAST(sid AS VARCHAR) segmentid
FROM split_up su
JOIN split_up.segment_array sid ON TRUE

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.