3

I have a table containing incident ids. Each incident can be linked to another one because they have been stated as similar. The table contains two columns:

  • incident_id the id of the incident
  • linked_incident_id the id of a linked incident

I would like to make groups based on these ids.

I tried to make left joins to link the ids but if an id is linked to n ids I will need to make n left joins. Imagine n=100, this would become absurd.

Here is a simplified version of my table:

incident_id linked_incident_id
1 5
4 8
2 3
8 2
9 10
5 null
7 9
2 8
9 7
6 10
10 null
12 11

Here is the expected output:

incident_id linked_incident_id groups
1 5 group 1
4 8 group 2
2 3 group 2
8 2 group 2
9 10 group 3
5 null group 1
7 9 group 3
2 8 group 2
9 7 group 3
6 10 group 3
10 null group 3
12 11 group 4
13 null group 5

The actual data I am working with is bigger (approx. 25000 lines).

I am actually working with Snowflake and PostgreSQL 17, but any SQL engine would be helpful, I am missing an effective logic here. An SQL query would be ideal !

2
  • 1
    Can you have (or, what do you have to prevent) circular references? For example 1->2 and 2->3 and 3->1 (Hint; In parent-child tree structures [aka adjacency lists], either the parent or the child should have a unique constraint on it.) Commented Jun 8 at 19:26
  • @MatBailie yes I can have those kind of cases in my table as show in the example I gave but I can find a way to remove them. Is this a mandatory step to achieve what I need ? Commented Jun 8 at 20:41

3 Answers 3

3

So, the problem is that incident1 is linked to incident2 which is linked to incident3 and so on and you need to make them be part of the same group. First of all, this is already possible with your data via recursive queries, here's an untested example

WITH RECURSIVE t AS (
    SELECT incident_id as group_id, incident_id from incidents where linked_incident_id is null
  union all
    select t.group_id, incidents.incident_id from t join incidents on t.incident_id = incidents.linked_incident_id
)
SELECT
    group_id,
    string_agg(incident_id::text, ',')
FROM t
group by group_id;

Tried this with the sample of

create table incidents (incident_id int, linked_incident_id int);

insert into incidents(incident_id, linked_incident_id)
values
(1, null),
(2, null),
(0, null),
(3, 1),
(5, 1),
(7, 3),
(9, 3),
(10, 2);

Fiddle.

So in this recursive query we first define the trivial records, that is, records whose linked incident is null (the root incidents) and define their incident_id as a group_id, which will later on transcend to their descendants via the second select after the union all, which finds descendants, then descendants of descendants and so on.

With this recursive logic we will end up having pairs of (group_id, incident_id) which are the pairs of root incidents and the current incident.

With this tool at our disposal we can simply select from this one, grouping by the group_id and concatenating the incident_id.

However, this is not an ideal solution. Because then you may end up with problems, when two issues link each-other and so forth. And of course, doing this kind of recursive search may not be best for performance either.

So, even though this is a solution for your current setup, I propose the following steps:

  • make sure that temporarily new records will not be inserted
  • create a root_incident_id field
  • update it for currently existent records, inspired by the results of the recursive query I have described earlier so each record will have a root incident pointing to the root problem's incident_id
  • create a trigger on your table for inserts, so if you create a new record, whose linked incident is not null, then update the root incident id to the root of the linked incident
  • create a trigger on your table for updates, so if you update an existent record, then update the root incident, to null if the linked incident is null and the root of the linked incident otherwise
  • create a trigger on your table for deletes, so if you delete an existent record whose linked incident is null, then 1. its immediate children (whose linked incident is the one you remove) will have null root and adjust the new root incident of their descendants

This way you will have a consistent database and you will not need recursion whenever you want to group by root incident

Sign up to request clarification or add additional context in comments.

19 Comments

As pointed out by Lajos, his current algorithm won't work on your data as is, @Leyth, because it first has to be normalized to a tree structure. For example your 2 > 8 and 8 > 2 mutual loop won't work from the start, as the starting point for Lajos' algorithm is that one of them has no linked_incident_id. Depending on your data a recursivity protection may be necessary, as well as something to enter self-isolated islands.
@leyth research Adjacency Lists for what you're doing, or Nested Sets as a common alternative.
@Leyth the problem you encountered is that your graph has cycles and therefore it's not a tree. If incident 1 is linked to incident 2 and incident 2 is linked to incident 1, then if you try to traverse these incidents, you will end up with an infinite cycle, as due to incident 1 incident 2 is evaluated, due to incident 2 incident 1 is evaluated, then incident 2, then incident 1 without the cycle ever ending.
@Leyth that's one of the options. The second option is to interpret your data as a tree.
@Leyth in order to interpret your graph as a tree you will need extra information that will somehow remove the cycles. So for the space of this discussion I will assume that we remove the cycles
|
2

General handling of loops and isolated islands

The biggest fear that I have in implementing a general "flocculation" algorithm with recursive CTE is that some islands have to be maintained unchanged for 1 or 2 iterations, while other islands are aggregating, finally reaching our previously "stable" island;
this seems to contradict a principle of recursive CTEs which is "if something doesn't move anymore at this iteration, stop working with it" (to avoid an infinite loop).

So, can we have something that maintains a state "as long as something still moves", while having a reasonable stop condition to avoid looping forever?

Of course we could get around recursive CTE, either by iterating only over one big state row (which we would control every aspect, particularly the "did anything move on last iteration?"), with JSON or such to store our complete state; or use PL/pgSQL.
But using our RDBMS as either a state machine or a procedural program seems a bit of a waste while we have a language dedicated to working on sets…

First attempt: brute-force

We can recursively walk through all relations from all nodes, with an array storing already encountered nodes to avoid loops.

Thus if 1-2-3-4 are linked, 1 will visit 2 while 2 visits 3 and 3 visits 4, but then 1 will visit 3 (through it newly aggregated 2) while 2 visits 4, and then 1 will visit 4 (through its new 3): when reaching 2 from 1, we continue visiting one by one, while we could have reused the already aggregated result of the walkthrough from 2.

So this is surely not the most efficient, but it is robust and can serve as a functional benchmark to compare the results of other implementations to.

with recursive
    -- Have each node represented (even node 3 of your example which exists only as a linked_incident_id, without an incident_id).
    n as (select incident_id id from incident union select linked_incident_id from incident where linked_incident_id is not null),
    -- Brute-recurse
    -- Each iteration will associate each id with:
    -- - its lowest encountered linked incident (gid)
    -- - the linked incidents it was receiving on last iteration (seing)
    -- - every incident it already encountered (seen)
    -- Iteration n + 1 simply unnests(seing), and for each of those looks in incident what its linked incidents are (or what incidents it is linked to);
    -- those already in seen are discarded,
    -- the remaining new ones build up the new seing array.
    r as
    (
        select id, id gid, array[id] seing, array[id] seen from n
        union all
        select
            id,
            min(case when gid < lid then gid else lid end),
            array_agg(lid),
            seen||array_agg(lid)
        from
        (
            with prev as (select id, gid, unnest(seing) lid, seen from r)
            select id, gid, l.linked_incident_id lid, seen from prev join incident l on prev.lid = l.incident_id and not l.linked_incident_id = any(seen)
            union
            select id, gid, l.incident_id lid, seen from prev join incident l on prev.lid = l.linked_incident_id and not l.incident_id = any(seen)
        )
        new
        group by id, seen
    ),
    -- Now for each incident_id which lowest gid did it ever see?
    g as
    (
        select id, min(gid) gid from r group by id
    )
-- Regroup back by group:
select gid, array_agg(id) ids from g group by gid;

Which returns:

gid ids
1 {1,5}
2 {8,2,4,3}
6 {6,9,10,7}
11 {12,11}
13 {13}

(see it as the first running query in this fiddle)

Second attempt: contagion

Now that we have the insurance of our results, we can try a more subtle flocculation:
start with groups of 1 or 2 incidents (1, to be sure isolated incidents gets their group; 2, when we have at least a link),
then recursively merge groups as soon as they have 1 element in common (PostgreSQL && operator on arrays).

with recursive
    -- Have each node represented (even those without any link, so that isolated nodes get at least their own group).
    n as (select incident_id id from incident union select linked_incident_id from incident where linked_incident_id is not null),
    -- Starting links. Always have the lowest ID be the group ID.
    -- We strictly compare incident_id to linked_incident_id:
    -- - thus nulls are removed
    -- - and we avoid the = case, which would be redundant with those from n (first select of the union)
    -- But still use union (instead of union all) for pure duplicates, or reverses.
    l as
    (
        select id gid, id from n
        union
        select incident_id, linked_incident_id from incident where incident_id < linked_incident_id
        union
        select linked_incident_id, incident_id from incident where incident_id > linked_incident_id
    ),
    r as
    (
        select gid, array_agg(id ) ids  from l group by gid
        union all
        select gid, ids  from
        (
            with
                -- We can only select once from r, but we need it twice. So put it in a CTE.
                rr as (select * from r),
                -- Any groups that have one element in common get merged.
                encounter as
                (
                    select g1.gid, g1.ids ids1, g2.ids ids2 
                    from rr g1 join rr g2 on g1.ids && g2.ids and g2.gid > g1.gid
                    and not g1.ids @> g2.ids -- To quickly converge, no need to make g1 reappear on next iteration if it already contained everything from the to-be-merged array.
                ),
                merged as
                (
                    select gid, unnest(ids1) id  from encounter
                    union
                    select gid, unnest(ids2) id  from encounter
                )
            select gid, array_agg(id ) ids  from merged group by gid 
        ) r2
    )
-- Now as the iterations returned every intermediate row,
-- keep only the groups whose array_length is the maximum known length for their head-of-group's ID.
-- This removes:
-- - merged groups (their head of group now appears in a group with at least 1 more element than their own group)
-- - previous iterations for the remaining groups
select * from r where (gid, array_length(ids, 1)) in
(
    select unnest(ids) id, max(array_length(ids, 1)) from r group by 1
)
order by gid
;

With a result similar to the first one:

gid ids
1 {1,5}
2 {2,3,4,8}
6 {6,7,9,10}
11 {11,12}
13 {13}

Or with each group as a side column of the original incident table (conforming to the expected result of the question):

incident_id linked_incident_id gid
1 5 1
4 8 2
2 3 2
8 2 2
9 10 6
5 null 1
7 9 6
2 8 2
9 7 6
6 10 6
10 null 6
12 11 11
13 null 13

(second query running in this fiddle)

N.B.: on my own test data, this subtle version ran in 5 iterations instead of 14 for the brute force one.

Third iteration: no array

As said by @MatBailie and tested on your real world data or my random data, a recursive CTE using PostgreSQL arrays as intermediate storage for groups won't handle more than 1500 links in less than 1 mn (and then grow exponentially).

We can improve it a bit:

  • avoid the array_agg() / unnest() between each iterations by relying on a simple flat recursive CTE of id bigint, gid bigint (each ID gets stored with the lowest group ID it could be attached to on last iteration)
  • do some "manual subiterations" on groups only during each iteration over the full dataset.
    If for example incident 100 appears in both groups 2 and 3, and incident 101 in groups 3 and 4, the current iteration will merge group 3 to 2 (add incidents of group 3 to group 2) thanks to 100, and group 4 to 3 thanks to 101. But then next iteration will merge the new 3 (= 3||4) to the new 2 (= 2||3), due to the 3 still in common.
    So we can directly merge groups 2, 3, and 4 without having to wait for the fully merged dataset of next iteration.
    Of course this in itself is recursive (we could have an infinite chain of pair by pair hoping), so we'll just hardcode a fixed number of left joins before passing on to the next conventional iteration.
    (and yes, here I'm bordering on procedural programming…)
    We've got two big benefits from those subiterations:
    1. we're not working on the full groups (as already partially aggregated from previous iterations), but only look for an opportunistic merge between the heads of each group. Thus we have a much smaller set to work with
      (and the more iterations we pass, the more we aggregate, the more the groups become big but the fewer they are: so our subiterations become more and more effective)
    2. we're not using recursive CTEs, thus we do not store the intermediate results:
      the more they merge, the more they alleviate memory pressure for the whole recursive CTE
      not really due to smaller groups (if we merge two groups of 10 tickets we get a group of OK maybe 17 tickets, because they had 3 tickets in common, the ones that helped identify that they were connected),
      but because by making 3 passes in 1 we skip having to store 2 intermediate recursive resultsets (that we want to trash anyway, because we're only interested in the final, biggest merge)

Performance improves (on a test with n random links between fictive nodes going from 1 to n, with a big cluster expected on node 1 due to manual links around it in addition to the random ones):

n v2 v3.0x0 v3.0x1 v3.0x2 v3.1x0 v3.1x1 v3.1x2 v3.1x3 v3.1x4 v3.1x5
1000 28 4 s
1500 2 mn 12 s
2000 6 mn 4 s 6 s 16 s
4000 15 s 45 s 2 mn 30 15 s 18 s 9 s 5 s 2.5 s 2.5 s
6000 40 s 50 s 26 s 15 s 8 s 7 s
8000 1 mn 20 2 mn 52 s 30 s 17 s 13 s

Legend:

  • v2 is the "Second attempt" above
  • v3 is this "no array" algorithm
  • the xn at the end gives the number of subiterations per iteration, thus the n of the last encountern CTE
    (so for example the query pasted below is the v3.1x2 because it ends with an encounter2)
  • v3.0 is a bugged version where the encountern CTEs lacked the distinct
    so the only distinct was on the join rr which is the full dataset, while we wanted to get quick by deduplicating while we worked only with the heads of groups (the contents of the encounterns :-\ )
  • v3.1 fixes it. Note that v3.0x0 has the same time as v3.1x0 (15 s), because x0 means "0 subiteration". And since the bug was on the subiterations…
  • The tests were run with shell for loops and a bit of SQL preprocessor to run x0, x1, x2, x3, x4, x5, x0, x1… (3 times)
    It looks like the more work is done in subiterations, the quicker we go.
    But those tests are not representative of reality, and are running over fully random links between random tickets
    It is possible that real data with "natural" clusters of tickets don't require as many subiterations to work optimally.

v3.1x6 was the best performing on tests with 25000 links, taking 4 to 8 mn (with random data making one big cluster sometimes going to 20000 incidents, the following clusters being never more than 12 incidents long).

-- Version 3, without arrays.
with recursive
    -- Have each node represented (even those without any link, so that isolated nodes get at least their own group).
    n as (select incident_id id from incident union select linked_incident_id from incident where linked_incident_id is not null),
    -- Starting links. Always have the lowest ID be the group ID.
    -- We strictly compare incident_id to linked_incident_id:
    -- - thus nulls are removed
    -- - and we avoid the = case, which would be redundant with those from n
    -- But still use union (instead of union all) for pure duplicates, or reverses.
    l as
    (
        select incident_id gid, linked_incident_id id from incident where incident_id < linked_incident_id
        union
        select linked_incident_id gid, incident_id id from incident where incident_id > linked_incident_id
    ),
    r as
    (
        select gid, id  from l
        union all
        select gid, gid  from l -- Make the gid (ID of the group's head) itself appear in the group.
        union all
        select gid, id  from
        (
            with
                -- We can only select once from r, but we need to self-join or group by. So put it in a CTE.
                rr as (select * from r),
                -- Each ID that appears in two groups merges those two groups.
                -- Thus if ID 8 appears in both groups 2 and 3, we know that 3 should be merged to 2.
                encounter as
                (
                    select distinct min(gid) over (partition by id) newgid, gid
                    from rr
                    where id in (select id from rr group by id having count(1) > 1)
                ),
                -- Now that our window function has run, we discard the "merge the lowest ID group to itself" row.
                encounter0 as (select * from encounter where newgid < gid),
                -- Chains will slow down our iteration:
                -- if 100 is common to 2 and 3, 101 to 3 and 4, 102 to 4 and 5, we'll want to merge 3 to 2, 4 to 3, 5 to 4.
                -- But then on next iteration we'll find 101 in both the new 2 (= 2||3) and the new 3 (= 3||4), so we'll want to merge them too.
                -- But then on next iteration we'll find 102 etc.
                -- So we could want to merge as many groups as possible in one iteration, to avoid merging them on this iteration AND on the next.
                -- This is less costly here (where we only handle group heads) than in the recursive merge (where we union ALL nodes of both groups).
                -- But if we have an infinite length chain, this would require recursivity in our recursivity, which complexifies a lot of thigs (and depending on the RDBMS may even not be accepted).
                -- However we can mitigate it by introducing some "manual iteration" steps here, for at least one batch of encounters where the newgid itself appears as a merged to a group of lower ID.
                encounter1 as
                (
                    select distinct coalesce(e1.newgid, e0.newgid) newgid, e0.gid
                    from encounter0 e0 left join encounter0 e1 on e0.newgid = e1.gid
                ),
                encounter2 as
                (
                    select distinct coalesce(e1.newgid, e0.newgid) newgid, e0.gid
                    from encounter1 e0 left join encounter1 e1 on e0.newgid = e1.gid
                )
                -- Now merge.
            select distinct newgid gid, id  from encounter2 e join rr on rr.gid in (e.gid, e.newgid)
        ) r2
    ),
    -- Each node now gets the lowest gid it saw during iterations as its definitive gid.
    best as
    (
        select min(gid) gid, id from r group by id
    ),
    -- Complete with isolated nodes.
    everything as
    (
        select * from best
        union all
        select id, id from n where id not in (select id from best)
    )
-- By group display:
--select gid, array_agg(id) ids from everything group by 1 order by 1;
-- By incident display:
select i.*, gid from incident i join everything g on incident_id = g.id;

And of course we get the same results as the two previous queries.

(see it as the third query in the fiddle)

4 Comments

Scaling to the OP's 25k rows is going to be brutal.
Ouch @MatBailie you're right. On my laptop 1000 links between 1000 incidents (insert into incident select random(1, 1000), random(1, 1000) from generate_series(1, 1000);) already take 40 s / 25 s (brute / fast method), but with 2000 links I'm already at 2 mn (with one big resulting blob of all incidents: big blob may be the problem). I probably should add the isolated nodes completing only after the recursion, avoid the last tedious unnest, and converge as quickly as possible (when 3 is linked to both 1 and 2, merge 2 and 3 to 1 instead of 3 to 2 and to 1 then 2 to 1).
@GuillaumeOutters, first of all let me thank you again for your help ! I tried to run the code on Snowflake, unfortunately I had some errors (Invalid query block: seen. it seems to be linked to the any function at line 24 of the brute force method & Syntax error: unexpected 'on'. at line 30 of your fast method. get a weird error at line). I exported the data to test locally on my laptop with PostgreSQL, I first tried with a small amount of data (~1500 rows) it took 1.5m / 1.2m. After this I tried on the whole data, I let it run for 30 minutes and then I stopped.
@Leyth version 3 of my answer could handle 25000 random links (by manual replication of encounterx blocks until having an encounter6 to use after the "Now merge" comment); I suggest that you give it a try, starting with 1000 rows. Sadly it probably won't work in Snowflake, because even if there is no more array (with seen|| or array[id] seen being the probable syntax culprits of your error with my version 1), the doubly recursive structure is unconventional (and unsupported?). However my v3 is more and more sequential: maybe a fully procedural way would please Snowflake?
1

Below query will give your desired results, please try and let me know if you are facing any challenge
**Tried on potsgres

WITH RECURSIVE incident_links AS
(
    -- Normalize the links to make the graph undirected
    SELECT incident_id, linked_incident_id
    FROM incident_details
    WHERE linked_incident_id IS NOT NULL
    
    UNION
    
    SELECT linked_incident_id AS incident_id, incident_id AS linked_incident_id
    FROM incident_details
    WHERE linked_incident_id IS NOT NULL
),
-- Recursive traversal to find connected components
grouped AS 
(
    SELECT incident_id, incident_id AS root
    FROM incident_links
    
    UNION
    
    SELECT il.linked_incident_id, g.root
    FROM grouped g
    JOIN incident_links il ON il.incident_id = g.incident_id
),
-- Assign the smallest root as the group identifier
group_roots AS 
(
    SELECT incident_id, MIN(root) AS group_root
    FROM grouped
    GROUP BY incident_id
),
-- Include unlinked incidents
all_incidents AS 
(
    SELECT DISTINCT incident_id FROM incident_details
    UNION
    SELECT DISTINCT linked_incident_id FROM incident_details WHERE linked_incident_id IS NOT NULL
),
group_assignments AS 
(
    SELECT 
        ai.incident_id,
        COALESCE(gr.group_root, ai.incident_id) AS final_root
    FROM all_incidents ai
    LEFT JOIN group_roots gr ON ai.incident_id = gr.incident_id
),
-- Assign group names
ranked_groups AS
(
    SELECT 
        incident_id,
    'group ' || DENSE_RANK() OVER (ORDER BY final_root) AS group_name
    FROM group_assignments
)
-- Join back to original table
SELECT 
    t.incident_id,
    t.linked_incident_id,
    rg.group_name
FROM 
    incident_details t
LEFT JOIN ranked_groups rg ON t.incident_id = rg.incident_id;

Input as you provided(incident_details tables)

select * from incident_details;
incident_id linked_incident_id
1 5
4 8
2 3
8 2
9 10
5
7 9
2 8
9 7
6 10
10
12 11
13

Output received from the above query

incident_id linked_incident_id group_name
1 5 group 1
4 8 group 2
2 3 group 2
8 2 group 2
9 10 group 3
5 group 1
7 9 group 3
2 8 group 2
9 7 group 3
6 10 group 3
10 group 3
12 11 group 4
13 group 5

7 Comments

Could you transform your images to Stack Overflow's markdown tables? You just have to copy-paste psql's output to the editing zone, transforming each + to a | (and not forgetting to add an empty line between the preceding text paragraph and the start of your table).
thanks i edited as suggested. this was also a new learning for me
I'll update snow version in sometime, this one i tested in postgresql and it worked as expected
Algorithmically, this solution has the same complexity as my first "brute force" one (full exploration of every relation); but technically it is way lighter (relying on PostgreSQL UNION to deduplicate and stop the loops, instead of my storing and lookups in arrays). And indeed, for my (very subjective) test of 2000 random links on my laptop, it takes 3.5 s where my v1 takes minutes (and my v3.1x4 milliseconds).
But on the other hand, it has a high memory cost, due to the recursive CTE which never discards its intermediate entries, and to the fact that the walkthrough is done the same for every incident, instead of having one big pool regrouping each group between iterations (thus if 1, 2, 3 are in the same group, we have 1: 1, 2, 3; 2: 1, 2, 3; 3: 1, 2, 3 in memory). Thus going further in my random tests, at 4000 links it still was usable (45 s) but with the memory, not the processor, being the bottleneck; and at 8000 links it finally got the PostgreSQL process killed.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.