2

I'm an SQL newbie, so there might be a standard recipe for the question I'm about to ask or it might already have been answered. I don't know the proper nomenclature for describing the type of operation I'm after, so I didn't know what to search for.

So, back to my problem: I have a table with transformations of an ID column that is used in other tables. It does include consecutive transformations, possibly several chained such transformations. All rows have a date stamp. Here's a toy sample:

|---------|-------|------------|
| from_id | to_id |       date |
|---------|-------|------------|
|    1001 |  2001 | 2019-01-01 |
|    1002 |  2002 | 2019-01-01 |
|    1003 |  2003 | 2019-02-02 |
|    2001 |  3001 | 2019-03-03 |
|    2002 |  3002 | 2019-03-03 |
|    3001 |  4001 | 2019-04-04 |
|---------|-------|------------|

From this data I would like to create two tables:

  1. A table linking every from_id to its last to_id. For my toy example i want the following:
|---------|-------------|
| from_id | final_to_id |
|---------|-------------|
|    1001 |        4001 |
|    1002 |        3002 |
|    1003 |        2003 |
|    2001 |        4001 |
|    2002 |        3002 |
|    3001 |        4001 |
|---------|-------------|
  1. I also want a table with all linked ID:s, in both directions. For my toy example:
|------|------|
| id_1 | id_2 |
|------|------|
| 1001 | 2001 |
| 1001 | 3001 |
| 1001 | 4001 |
| 1002 | 2002 |
| 1002 | 3002 |
| 1003 | 2003 |
| 2001 | 1001 |
| 2001 | 3001 |
| 2001 | 4001 |
| 2002 | 1002 |
| 2002 | 3002 |
| 2003 | 1003 |
| 3001 | 1001 |
| 3001 | 2001 |
| 3001 | 4001 |
| 3002 | 1002 |
| 3002 | 2002 |
| 4001 | 1001 |
| 4001 | 2001 |
| 4001 | 3001 |
|------|------|

These two results could of course be combined into just one table, where the relevant rows for result 1 is just highlighted by a flag.

So, how do I do this in SQL? Any help deeply appreciated. Note that I don't know how many steps the longest chain of transformations contains, but I do know that it's increasing from day to day.

8
  • 1
    I don't understand the logic behind the relationships in your two expected output tables. Commented Oct 11, 2019 at 6:47
  • 1
    Check if your dbms supports recursive cte. Commented Oct 11, 2019 at 6:55
  • 1
    You example data suggest that the last digit determines a group it belongs to. Is that also the case for your real data? Commented Oct 11, 2019 at 7:18
  • @a_horse_with_no_name I'm using Snowflake, which has its own dialect based on SQL-92 ANSI. I've added a Snowflake tag. Commented Oct 11, 2019 at 8:06
  • @Serg Yes, it does. Commented Oct 11, 2019 at 8:07

2 Answers 2

2

As Gordon said, you need a recursive CTE, which embodies a particular kind of fixed-point recursion. I think it's easiest to think about this in two phases:

  1. Complete the transitive closure; then
  2. Use this to produce the result you want.

We'll use the CTE for the first part. The CTE has two clauses, separated by a "union all". The first clause is run once to prime them pump; the second one runs repeatedly until it produces no output (or exceeds the allowable number of iterations). Each time the second clause runs, two things happen:

  • The results are appended to the CTE results; and
  • The results replace the "working" value of the CTE within the CTE.

With that in mind, here's a CTE that computes the transitive closure query. It makes a few important assumptions:

  • There are no cycles in your ID chains;
  • The IDs are always increasing; and
  • The dates are irrelevant.

Here's the code:

with cte as (
  select from_id, to_id
  from t
  union all
  select t1.from_id, t2.to_id
  from cte t1 join t t2 on t1.to_id = t2.from_id
)
select * from cte;

The first clause will generate your original table (minus the date column):

Round 1:
FROM_ID    TO_ID   
-------  -------  
   1001     2001  
   1002     2002  
   1003     2003  
   2001     3001  
   2002     3002  
   3001     4001  

Then the second clause will use this result as the working table and join it with your original table. This will produce the next round:

Round 2:
FROM_ID    TO_ID   
-------  ------- 
   1001     3001  
   1002     3002  
   2001     4001  

This will be appended to the result, but becomes the working table for the next round. So our third round gives us:

Round 3:
FROM_ID    TO_ID   
-------  ------- 
   1001     4001  

The next round gives no results, which means no round will ever produce any new results—the CTE has reached a fixed-point. This is when the CTE terminates and gives us its final result:

FROM_ID    TO_ID   
-------  -------  
   1001     2001  
   1002     2002  
   1003     2003  
   2001     3001  
   2002     3002  
   3001     4001  
   1001     3001  
   1002     3002  
   2001     4001  
   1001     4001 

We still need to do step 2, but from here it's relatively easy: your result is just the set of rows in the CTE with the maximal TO_ID for each FROM_ID. We'll add a little post-processing to the CTE:

with cte as (
  select from_id, to_id
  from t
  union all
  select t1.from_id, t2.to_id
  from cte t1 join t t2 on t1.to_id = t2.from_id
)
select from_id, max(to_id) as to_id
from cte
group by from_id; 

Which gives us:

FROM_ID    TO_ID   
-------  ------- 
   1001     4001
   1002     3002
   1003     2003
   2001     4001
   2002     3002
   3001     4001

There we go. As Gordon pointed out, the other question should also be straightforward using the results of the same CTE:

with cte as (
  select from_id, to_id
  from t
  union all
  select t1.from_id, t2.to_id
  from cte t1 join t t2 on t1.to_id = t2.from_id
)
select from_id id_1, to_id id_1
from cte
union all
select to_id id_1, from_id id_1
from cte;
Sign up to request clarification or add additional context in comments.

1 Comment

@Isasac Thanks for your thorough reply. I never used recursive CTE:s before so this was great for me. I ended up using your code with the extension of including a monotonically increasing counter column to account for more general ID transformations that do not necessarily increase in value.
2

You need a recursive CTE for this purpose. I think the simplest method is to go backwards from the most recent date:

with cte as (
      select t.to_id, t.from_id, t.from_id as terminal_id
      from toy t
      where not exists (select 1 from toy t2 where t2.from_id = t.to_id)
      union all
      select t.to_id, t.from_id, cte.terminal_id
      from cte join
           toy t
           on t.to_id = cte.from_id
     )
select *
from cte;

This produces half the rows. For the other half, you can do:

select to_id, from_id, terminal_id
from cte
union all
select from_id, to_id, terminal_id
from cte;

1 Comment

Thanks for your reply. Spot on. I ended up accepting the answer by @Isaac because it was more informative for a CTE newbie as myself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.