General handling of loops and isolated islands
The biggest fear that I have in implementing a general "flocculation" algorithm with recursive CTE is that some islands have to be maintained unchanged for 1 or 2 iterations, while other islands are aggregating, finally reaching our previously "stable" island;
this seems to contradict a principle of recursive CTEs which is "if something doesn't move anymore at this iteration, stop working with it" (to avoid an infinite loop).
So, can we have something that maintains a state "as long as something still moves", while having a reasonable stop condition to avoid looping forever?
Of course we could get around recursive CTE, either by iterating only over one big state row (which we would control every aspect, particularly the "did anything move on last iteration?"), with JSON or such to store our complete state; or use PL/pgSQL.
But using our RDBMS as either a state machine or a procedural program seems a bit of a waste while we have a language dedicated to working on sets…
First attempt: brute-force
We can recursively walk through all relations from all nodes, with an array storing already encountered nodes to avoid loops.
Thus if 1-2-3-4 are linked, 1 will visit 2 while 2 visits 3 and 3 visits 4, but then 1 will visit 3 (through it newly aggregated 2) while 2 visits 4, and then 1 will visit 4 (through its new 3): when reaching 2 from 1, we continue visiting one by one, while we could have reused the already aggregated result of the walkthrough from 2.
So this is surely not the most efficient, but it is robust and can serve as a functional benchmark to compare the results of other implementations to.
with recursive
-- Have each node represented (even node 3 of your example which exists only as a linked_incident_id, without an incident_id).
n as (select incident_id id from incident union select linked_incident_id from incident where linked_incident_id is not null),
-- Brute-recurse
-- Each iteration will associate each id with:
-- - its lowest encountered linked incident (gid)
-- - the linked incidents it was receiving on last iteration (seing)
-- - every incident it already encountered (seen)
-- Iteration n + 1 simply unnests(seing), and for each of those looks in incident what its linked incidents are (or what incidents it is linked to);
-- those already in seen are discarded,
-- the remaining new ones build up the new seing array.
r as
(
select id, id gid, array[id] seing, array[id] seen from n
union all
select
id,
min(case when gid < lid then gid else lid end),
array_agg(lid),
seen||array_agg(lid)
from
(
with prev as (select id, gid, unnest(seing) lid, seen from r)
select id, gid, l.linked_incident_id lid, seen from prev join incident l on prev.lid = l.incident_id and not l.linked_incident_id = any(seen)
union
select id, gid, l.incident_id lid, seen from prev join incident l on prev.lid = l.linked_incident_id and not l.incident_id = any(seen)
)
new
group by id, seen
),
-- Now for each incident_id which lowest gid did it ever see?
g as
(
select id, min(gid) gid from r group by id
)
-- Regroup back by group:
select gid, array_agg(id) ids from g group by gid;
Which returns:
| gid |
ids |
| 1 |
{1,5} |
| 2 |
{8,2,4,3} |
| 6 |
{6,9,10,7} |
| 11 |
{12,11} |
| 13 |
{13} |
(see it as the first running query in this fiddle)
Second attempt: contagion
Now that we have the insurance of our results, we can try a more subtle flocculation:
start with groups of 1 or 2 incidents (1, to be sure isolated incidents gets their group; 2, when we have at least a link),
then recursively merge groups as soon as they have 1 element in common (PostgreSQL && operator on arrays).
with recursive
-- Have each node represented (even those without any link, so that isolated nodes get at least their own group).
n as (select incident_id id from incident union select linked_incident_id from incident where linked_incident_id is not null),
-- Starting links. Always have the lowest ID be the group ID.
-- We strictly compare incident_id to linked_incident_id:
-- - thus nulls are removed
-- - and we avoid the = case, which would be redundant with those from n (first select of the union)
-- But still use union (instead of union all) for pure duplicates, or reverses.
l as
(
select id gid, id from n
union
select incident_id, linked_incident_id from incident where incident_id < linked_incident_id
union
select linked_incident_id, incident_id from incident where incident_id > linked_incident_id
),
r as
(
select gid, array_agg(id ) ids from l group by gid
union all
select gid, ids from
(
with
-- We can only select once from r, but we need it twice. So put it in a CTE.
rr as (select * from r),
-- Any groups that have one element in common get merged.
encounter as
(
select g1.gid, g1.ids ids1, g2.ids ids2
from rr g1 join rr g2 on g1.ids && g2.ids and g2.gid > g1.gid
and not g1.ids @> g2.ids -- To quickly converge, no need to make g1 reappear on next iteration if it already contained everything from the to-be-merged array.
),
merged as
(
select gid, unnest(ids1) id from encounter
union
select gid, unnest(ids2) id from encounter
)
select gid, array_agg(id ) ids from merged group by gid
) r2
)
-- Now as the iterations returned every intermediate row,
-- keep only the groups whose array_length is the maximum known length for their head-of-group's ID.
-- This removes:
-- - merged groups (their head of group now appears in a group with at least 1 more element than their own group)
-- - previous iterations for the remaining groups
select * from r where (gid, array_length(ids, 1)) in
(
select unnest(ids) id, max(array_length(ids, 1)) from r group by 1
)
order by gid
;
With a result similar to the first one:
| gid |
ids |
| 1 |
{1,5} |
| 2 |
{2,3,4,8} |
| 6 |
{6,7,9,10} |
| 11 |
{11,12} |
| 13 |
{13} |
Or with each group as a side column of the original incident table (conforming to the expected result of the question):
| incident_id |
linked_incident_id |
gid |
| 1 |
5 |
1 |
| 4 |
8 |
2 |
| 2 |
3 |
2 |
| 8 |
2 |
2 |
| 9 |
10 |
6 |
| 5 |
null |
1 |
| 7 |
9 |
6 |
| 2 |
8 |
2 |
| 9 |
7 |
6 |
| 6 |
10 |
6 |
| 10 |
null |
6 |
| 12 |
11 |
11 |
| 13 |
null |
13 |
(second query running in this fiddle)
N.B.: on my own test data, this subtle version ran in 5 iterations instead of 14 for the brute force one.
Third iteration: no array
As said by @MatBailie and tested on your real world data or my random data, a recursive CTE using PostgreSQL arrays as intermediate storage for groups won't handle more than 1500 links in less than 1 mn (and then grow exponentially).
We can improve it a bit:
- avoid the
array_agg() / unnest() between each iterations by relying on a simple flat recursive CTE of id bigint, gid bigint (each ID gets stored with the lowest group ID it could be attached to on last iteration)
- do some "manual subiterations" on groups only during each iteration over the full dataset.
If for example incident 100 appears in both groups 2 and 3, and incident 101 in groups 3 and 4, the current iteration will merge group 3 to 2 (add incidents of group 3 to group 2) thanks to 100, and group 4 to 3 thanks to 101. But then next iteration will merge the new 3 (= 3||4) to the new 2 (= 2||3), due to the 3 still in common.
So we can directly merge groups 2, 3, and 4 without having to wait for the fully merged dataset of next iteration.
Of course this in itself is recursive (we could have an infinite chain of pair by pair hoping), so we'll just hardcode a fixed number of left joins before passing on to the next conventional iteration.
(and yes, here I'm bordering on procedural programming…)
We've got two big benefits from those subiterations:
- we're not working on the full groups (as already partially aggregated from previous iterations), but only look for an opportunistic merge between the heads of each group. Thus we have a much smaller set to work with
(and the more iterations we pass, the more we aggregate, the more the groups become big but the fewer they are: so our subiterations become more and more effective)
- we're not using recursive CTEs, thus we do not store the intermediate results:
the more they merge, the more they alleviate memory pressure for the whole recursive CTE
not really due to smaller groups (if we merge two groups of 10 tickets we get a group of OK maybe 17 tickets, because they had 3 tickets in common, the ones that helped identify that they were connected),
but because by making 3 passes in 1 we skip having to store 2 intermediate recursive resultsets (that we want to trash anyway, because we're only interested in the final, biggest merge)
Performance improves (on a test with n random links between fictive nodes going from 1 to n, with a big cluster expected on node 1 due to manual links around it in addition to the random ones):
n |
v2 |
v3.0x0 |
v3.0x1 |
v3.0x2 |
v3.1x0 |
v3.1x1 |
v3.1x2 |
v3.1x3 |
v3.1x4 |
v3.1x5 |
| 1000 |
28 |
|
|
4 s |
|
|
|
|
|
|
| 1500 |
2 mn |
|
|
12 s |
|
|
|
|
|
|
| 2000 |
6 mn |
4 s |
6 s |
16 s |
|
|
|
|
|
|
| 4000 |
|
15 s |
45 s |
2 mn 30 |
15 s |
18 s |
9 s |
5 s |
2.5 s |
2.5 s |
| 6000 |
|
|
|
|
40 s |
50 s |
26 s |
15 s |
8 s |
7 s |
| 8000 |
|
|
|
|
1 mn 20 |
2 mn |
52 s |
30 s |
17 s |
13 s |
Legend:
- v2 is the "Second attempt" above
- v3 is this "no array" algorithm
- the x
n at the end gives the number of subiterations per iteration, thus the n of the last encountern CTE
(so for example the query pasted below is the v3.1x2 because it ends with an encounter2)
- v3.0 is a bugged version where the
encountern CTEs lacked the distinct
so the only distinct was on the join rr which is the full dataset, while we wanted to get quick by deduplicating while we worked only with the heads of groups (the contents of the encounterns :-\ )
- v3.1 fixes it. Note that v3.0x0 has the same time as v3.1x0 (15 s), because x0 means "0 subiteration". And since the bug was on the subiterations…
- The tests were run with shell
for loops and a bit of SQL preprocessor to run x0, x1, x2, x3, x4, x5, x0, x1… (3 times)
It looks like the more work is done in subiterations, the quicker we go.
But those tests are not representative of reality, and are running over fully random links between random tickets
It is possible that real data with "natural" clusters of tickets don't require as many subiterations to work optimally.
v3.1x6 was the best performing on tests with 25000 links, taking 4 to 8 mn (with random data making one big cluster sometimes going to 20000 incidents, the following clusters being never more than 12 incidents long).
-- Version 3, without arrays.
with recursive
-- Have each node represented (even those without any link, so that isolated nodes get at least their own group).
n as (select incident_id id from incident union select linked_incident_id from incident where linked_incident_id is not null),
-- Starting links. Always have the lowest ID be the group ID.
-- We strictly compare incident_id to linked_incident_id:
-- - thus nulls are removed
-- - and we avoid the = case, which would be redundant with those from n
-- But still use union (instead of union all) for pure duplicates, or reverses.
l as
(
select incident_id gid, linked_incident_id id from incident where incident_id < linked_incident_id
union
select linked_incident_id gid, incident_id id from incident where incident_id > linked_incident_id
),
r as
(
select gid, id from l
union all
select gid, gid from l -- Make the gid (ID of the group's head) itself appear in the group.
union all
select gid, id from
(
with
-- We can only select once from r, but we need to self-join or group by. So put it in a CTE.
rr as (select * from r),
-- Each ID that appears in two groups merges those two groups.
-- Thus if ID 8 appears in both groups 2 and 3, we know that 3 should be merged to 2.
encounter as
(
select distinct min(gid) over (partition by id) newgid, gid
from rr
where id in (select id from rr group by id having count(1) > 1)
),
-- Now that our window function has run, we discard the "merge the lowest ID group to itself" row.
encounter0 as (select * from encounter where newgid < gid),
-- Chains will slow down our iteration:
-- if 100 is common to 2 and 3, 101 to 3 and 4, 102 to 4 and 5, we'll want to merge 3 to 2, 4 to 3, 5 to 4.
-- But then on next iteration we'll find 101 in both the new 2 (= 2||3) and the new 3 (= 3||4), so we'll want to merge them too.
-- But then on next iteration we'll find 102 etc.
-- So we could want to merge as many groups as possible in one iteration, to avoid merging them on this iteration AND on the next.
-- This is less costly here (where we only handle group heads) than in the recursive merge (where we union ALL nodes of both groups).
-- But if we have an infinite length chain, this would require recursivity in our recursivity, which complexifies a lot of thigs (and depending on the RDBMS may even not be accepted).
-- However we can mitigate it by introducing some "manual iteration" steps here, for at least one batch of encounters where the newgid itself appears as a merged to a group of lower ID.
encounter1 as
(
select distinct coalesce(e1.newgid, e0.newgid) newgid, e0.gid
from encounter0 e0 left join encounter0 e1 on e0.newgid = e1.gid
),
encounter2 as
(
select distinct coalesce(e1.newgid, e0.newgid) newgid, e0.gid
from encounter1 e0 left join encounter1 e1 on e0.newgid = e1.gid
)
-- Now merge.
select distinct newgid gid, id from encounter2 e join rr on rr.gid in (e.gid, e.newgid)
) r2
),
-- Each node now gets the lowest gid it saw during iterations as its definitive gid.
best as
(
select min(gid) gid, id from r group by id
),
-- Complete with isolated nodes.
everything as
(
select * from best
union all
select id, id from n where id not in (select id from best)
)
-- By group display:
--select gid, array_agg(id) ids from everything group by 1 order by 1;
-- By incident display:
select i.*, gid from incident i join everything g on incident_id = g.id;
And of course we get the same results as the two previous queries.
(see it as the third query in the fiddle)
1->2and2->3and3->1(Hint; In parent-child tree structures [aka adjacency lists], either the parent or the child should have a unique constraint on it.)