0

Within my PostgreSQL database, I have an id column that shows each unique lead that comes in. I also have a connected_lead_id column which shows whether accounts are related to each other (ie husband and wife, parents and children, group of friends, group of investors, etc).

When we count the number of ids created during a time period, we want to see the number of unique "groups" of connected_ids during a period. In other words, we wouldn't want to count both the husband and wife pair, we would only want to count one since they are truly one lead.

We want to be able to create a view that only has the "first" id based on the "created_at" date and then contains additional columns at the end for "connected_lead_id_1", "connected_lead_id_2", "connected_lead_id_3", etc.

We want to add in additional logic so that we take the "first" id's source, unless that is null, then take the "second" connected_lead_id's source unless that is null and so on. Finally, we want to take the earliest on_boarded_date from the connected_lead_id group.

id    |    created_at      | connected_lead_id | on_boarded_date | source     |
  2   | 9/24/15 23:00      |        8          |                 |
  4   |  9/25/15 23:00     |        7          |                 |event
  7   |  9/26/15 23:00     |        4          |                 |
  8   |  9/26/15 23:00     |        2          |                 |referral
  11  |  9/26/15 23:00     |       336         |   7/1/17        |online
  142 |  4/27/16 23:00     |       336         |                 |
  336 |  7/4/16 23:00      |        11         |   9/20/18       |referral

End Goal:

id    |    created_at      | on_boarded_date | source     |  
  2   | 9/24/15 23:00      |                 | referral   |
  4   |  9/25/15 23:00     |                 | event      |
  11  |  9/26/15 23:00     |   7/1/17        | online     |

Ideally, we would also have i number of extra columns at the end to show each connected_lead_id that is attached to the base id.

Thanks for the help!

7
  • Which DBMS? Oracle, SQL Server, MySQL, PostgreSQL, etc. Commented Dec 4, 2018 at 18:09
  • PostgreSQL, apologies. I will add this in the summary. Commented Dec 4, 2018 at 18:22
  • Should be added as a tag Commented Dec 4, 2018 at 18:22
  • The main problem for me is: For a recursion you need a starting point (as in a tree structure). But you have circles. Where is the starting point of the circle. Which should be my starting row? id = 2 or id = 8? Commented Dec 4, 2018 at 21:06
  • I believe you'll need a function: LOOP through the rows, save the passed ids. If any new id is found, print out. Commented Dec 4, 2018 at 21:08

2 Answers 2

1

Ok the best I can come up with at the moment is to first build maximal groups of related IDs, and then join back to your table of leads to get the rest of the data (See this SQL Fiddle for the setup, full queries and results).

To get the maximal groups you can use a recursive common table expression to first grow the groups, followed by a query to filter the CTE results down to just the maximal groups:

with recursive cte(grp) as (
select case when l.connected_lead_id is null then array[l.id] 
            else array[l.id, l.connected_lead_id]
       end      from leads l
union all
select grp || l.id
  from leads l
  join cte
    on l.connected_lead_id = any(grp)
   and not l.id = any(grp)
)
select * from cte c1

The CTE above outputs several similar groups as well as intermediary groups. The query predicate below prunes out the non maximal groups, and limits results to just one permutation of each possible group:

 where not exists (select 1 from cte c2
                   where c1.grp && c2.grp
                     and ((not c1.grp @> c2.grp)
                       or (c2.grp < c1.grp
                      and c1.grp @> c2.grp
                      and c1.grp <@ c2.grp)));

Results:

|        grp |
|------------|
|        2,8 |
|        4,7 |
|         14 |
| 11,336,142 |
|      12,13 |

Next join the final query above back to your leads table and use window functions to get the remaining column values, along with the distinct operator to prune it down to the final result set:

with recursive cte(grp) as (
...
)
select distinct 
       first_value(l.id) over (partition by grp order by l.created_at) id
     , first_value(l.created_at) over (partition by grp order by l.created_at) create_at
     , first_value(l.on_boarded_date) over (partition by grp order by l.created_at) on_boarded_date
     , first_value(l.source) over (partition by grp 
                                   order by case when l.source is null then 2 else 1 end
                                   , l.created_at) source
     , grp CONNECTED_IDS
  from cte c1
  join leads l
    on l.id = any(grp)
 where not exists (select 1 from cte c2
                   where c1.grp && c2.grp
                     and ((not c1.grp @> c2.grp)
                       or (c2.grp < c1.grp
                      and c1.grp @> c2.grp
                      and c1.grp <@ c2.grp)));

Results:

| id |            create_at | on_boarded_date |   source | connected_ids |
|----|----------------------|-----------------|----------|---------------|
|  2 | 2015-09-24T23:00:00Z |          (null) | referral |           2,8 |
|  4 | 2015-09-25T23:00:00Z |          (null) |    event |           4,7 |
| 11 | 2015-09-26T23:00:00Z |      2017-07-01 |   online |    11,336,142 |
| 12 | 2015-09-26T23:00:00Z |      2017-07-01 |    event |         12,13 |
| 14 | 2015-09-26T23:00:00Z |          (null) |   (null) |            14 |
Sign up to request clarification or add additional context in comments.

3 Comments

Hi Sentinel, this is an awesome answer! And it allows me to complete it within my read permissions. Are we able to add onto this code so it also pulls in any ids that do not have a connected_lead_id and no other ids claim them as a connected_lead_id? In other words, if we had an id = 12 that had connected_lead_id = null, could we pull that in to the results as well?
Absolutely. I've updated the above answer to take NULL valued connected lead IDs into account using a case statement to create the initial array, and i've updated the referenced SQL Fiddle to demonstrate. I added three records 12, 13, 14. Both 12 and 14 have no connected leads, but 13 connects to 12. 12 & 13 get grouped even though 12 doesn't know it's connected to 13. And 14 is in a group all by its lonesome.
Phenomenal! Works perfectly. Thanks again for the help. My goal is to go over this piece by piece today and hopefully better understand what is going on. Cheers!
1

demo:db<>fiddle

Main idea - sketch:

  1. Looping through the ordered set. Get all ids, that haven't been seen before in any connected_lead_id (cli). These are your starting points for recursion. The problem is your number 142 which hasn't been seen before but is in same group as 11 because of its cli. So it is would be better to get the clis of the unseen ids. With these values it's much simpler to calculate the ids of the groups later in the recursion part. Because of the loop a function/stored procedure is necessary.

  2. The recursion part: First step is to get the ids of the starting clis. Calculating the first referring id by using the created_at timestamp. After that a simple tree recursion over the clis can be done.

1. The function:

CREATE OR REPLACE FUNCTION filter_groups() RETURNS int[] AS $$
DECLARE
    _seen_values int[];
    _new_values int[];
    _temprow record;
BEGIN
    FOR _temprow IN
        -- 1:
        SELECT array_agg(id ORDER BY created_at) as ids, connected_lead_id FROM groups GROUP BY connected_lead_id ORDER BY MIN(created_at)
    LOOP
        -- 2:
        IF array_length(_seen_values, 1) IS NULL 
            OR (_temprow.ids || _temprow.connected_lead_id) && _seen_values = FALSE THEN

            _new_values := _new_values || _temprow.connected_lead_id;
        END IF;

        _seen_values := _seen_values || _temprow.ids;
        _seen_values := _seen_values || _temprow.connected_lead_id;
    END LOOP;

    RETURN _new_values;
END;
$$ LANGUAGE plpgsql;
  1. Grouping all ids that refer to the same cli
  2. Loop through the id arrays. If no element of the array was seen before, add the referred cli the output variable (_new_values). In both cases add the ids and the cli to the variable which stores all yet seen ids (_seen_values)
  3. Give out the clis.

The result so far is {8, 7, 336} (which is equivalent to the ids {2,4,11,142}!)

2. The recursion:

-- 1:
WITH RECURSIVE start_points AS (
    SELECT unnest(filter_groups()) as ids
),
filtered_groups AS (
    -- 3:
    SELECT DISTINCT
       1 as depth, -- 3
       first_value(id) OVER w as id, -- 4
       ARRAY[(MIN(id) OVER w)] as visited, -- 5
       MIN(created_at) OVER w as created_at,
       connected_lead_id,
       MIN(on_boarded_date) OVER w as on_boarded_date -- 6,
       first_value(source) OVER w as source
    FROM groups 
    WHERE connected_lead_id IN (SELECT ids FROM start_points)
    -- 2:
    WINDOW w AS (PARTITION BY connected_lead_id ORDER BY created_at)

    UNION

    SELECT
        fg.depth + 1,
        fg.id,
        array_append(fg.visited, g.id), -- 8
        LEAST(fg.created_at, g.created_at), 
        g.connected_lead_id, 
        LEAST(fg.on_boarded_date, g.on_boarded_date), -- 9
        COALESCE(fg.source, g.source) -- 10
    FROM groups g
    JOIN filtered_groups fg
    -- 7
    ON fg.connected_lead_id = g.id AND NOT (g.id = ANY(visited))

)
SELECT DISTINCT ON (id) -- 11
    id, created_at,on_boarded_date, source 
FROM filtered_groups 
ORDER BY id, depth DESC;
  1. The WITH part gives out the results from the function. unnest() expands the id array into each row for each id.
  2. Creating a window: The window function groups all values by their clis and orders the window by the created_at timestamp. In your example all values are in their own window excepting 11 and 142 which are grouped.
  3. This is a help variable to get the latest rows later on.
  4. first_value() gives the first value of the ordered window frame. Assuming 142 had a smaller created_at timestamp the result would have been 142. But it's 11 nevertheless.
  5. A variable is needed to save which id has been visited yet. Without this information an infinite loop would be created: 2-8-2-8-2-8-2-8-...
  6. The minimum date of the window is taken (same thing here: if 142 would have a smaller date than 11 this would be the result).

Now the starting query of the recursion is calculated. Following describes the recursion part:

  1. Joining the table (the original function results) against the previous recursion result. The second condition is the stop of the infinite loop I mentioned above.
  2. Appending the currently visited id into the visited variable.
  3. If the current on_boarded_date is earlier it is taken.
  4. COALESCE gives the first NOT NULL value. So the first NOT NULL source is safed throughout the whole recursion

After the recursion which gives a result of all recursion steps we want to filter out only the deepest visits of every starting id.

  1. DISTINCT ON (id) gives out the row with the first occurence of an id. To get the last one, the whole set is descendingly ordered by the depth variable.

2 Comments

This looks like it checks out. I'm passing it along to someone with the ability to create functions to make sure it checks out on our full data set. I will update as soon as I have confirmation that it is working. Thank you so much for putting the thought into this. The code and logic is mind-blowing.
This code also works and is a great answer! The only reason I selected the other as the answer is because I could implement it with my read-only access. But this also functioned great when I had someone with other permissions run everything. Thank you again for the excellent help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.