SQL count using multiple join

Question

I have three tables related to each other in the following way:

host (has multiple sessions)
session (has multiple processes)
processes

The table structure is as follows:

host table - id, name
session table - id, host_id, name
process table - id, session_id, name

What I am trying to achieve is the count of the number of sessions and the count of the number of processes on each host. To achieve this I tried the following query, but the output is wrong.

select host.id, 
       count(sessions.id) as "session count", 
       count(process.id) as "process count"
from host as host
     left outer join sessions as sessions on host.id = sessions.host_id
     left outer join process as process on sessions.id = process.session_id
group by host.id;

Here's the SQLFiddle to the schema.

As per the data in the fiddle, the output should be:

id | session count | process count 
----------------------------------
1  |     2         |   3
2  |     1         |   2
3  |     1         |   2
4  |     2         |   3

But what I get is:

id | session count | process count 
----------------------------------
1  |     3         |   3
2  |     2         |   2
3  |     2         |   2
4  |     3         |   3

What can be the correct query to get the desired output?

user1529235 · Accepted Answer · 2017-07-07 14:06:28Z

6

Distinct;

select host.id, 
       count(distinct sessions.id) as "session count", 
       count(distinct process.id) as "process count"
from host as host
     left outer join sessions as sessions on host.id = sessions.host_id
     left outer join process as process on sessions.id = process.session_id
group by host.id;

answered Jul 7, 2017 at 14:06

user1529235

Sign up to request clarification or add additional context in comments.

9 Comments

Olivier Jacot-Descombes Over a year ago

I think distinct isn't necessary for the process.

user1529235 Over a year ago

Hi Olivier. I know there are numerous ways to get the outcome but this was the easiest tweak to the existing code.

Prerak Sola Over a year ago

@OlivierJacot-Descombes the distinct keyword gives the expected output. However could you suggest what can be the other ways? I would like them to know. They might be better to integrate with my ORM. I am struggling to integrate the use of distinct in my ORM.

Olivier Jacot-Descombes Over a year ago

The solution of @user1529235 is okay, but you can just drop the distinct for the processes, as the query yields distinct process ids anyway. It doesn't yield distinct session ids so. Query select host.id, sessions.id as sid, process.id as pid from ... (without group by) and you will see.

theo Over a year ago

I actually think that given the schema as is using a count(distinct [field]) is probably the best option. I'll add a more complex answer below.

|

Olivier Jacot-Descombes · Accepted Answer · 2017-07-07 14:34:38Z

1

If you query without the group by-clause, you will see that you are getting the same session id multiple times. Therefore you sessions count is too high.

select h.id as hid, s.id as sid, p.id as pid
from host h
left join sessions s on h.id = s.host_id
left join process p on s.id = p.session_id
order by h.id, s.id, p.id;

hid sid pid
-----------
1   1   1
1   1   2
1   2   5
2   5   8
2   5   9
3   3   3
3   3   7
4   4   4
4   4   6
4   6   10

Therefore use count(distinct s.id) for the sessions:

select h.id as hid, count(distinct s.id) as session_count, count(p.id) as process_count
from host h
left join sessions s on h.id = s.host_id
left join process p on s.id = p.session_id
group by h.id

edited Jul 7, 2017 at 14:34

answered Jul 7, 2017 at 14:29

Olivier Jacot-Descombes

114k14 gold badges149 silver badges202 bronze badges

Comments

Martin Brown · Accepted Answer · 2017-07-07 14:40:13Z

1

John Faz's answer is better, however as you asked for other ways, it is possible to do this with sub queries as well like this:

select
  host.id,
  (select count(*) from sessions where host_id = host.id) as "session count",
  (select count(*) from process join sessions on process.session_id = sessions.id where sessions.host_id = host.id)  as "process count"
 from
   host

EDIT:

Actually I take back that bit about John Faz's answer being better. I just ran an execution plan over the two and my query took 28% and John's took 50% (22% set up and tear down). I was using only the very small amount of data from the SQL Fiddle example and with big data and different index choices things are likely to be different. However it does show that this query may be better in some circumstances.

edited Jul 7, 2017 at 14:40

answered Jul 7, 2017 at 14:23

Martin Brown

25.5k16 gold badges88 silver badges134 bronze badges

2 Comments

theo Over a year ago

I think it's a common mistake to assume that faster = better. Speed is certainly important, but it's not the only important thing to consider. maintainability/readability of code often counts for a lot, and sub-queries are almost always incredibly to maintain

Martin Brown Over a year ago

@theo Agreed, wasn't suggesting this is always better, only sometimes. Also I was trying show surprise that this option is ever better. To know what's best would require way more knowledge about the circumstances.

Martin Brown · Accepted Answer · 2017-07-07 16:55:13Z

The real issue here is that you have a chain of 1 to many relationships you are working with. If it were just one relationship in the chain a count() function would work fine with no issues. But having them chained together results in the intermediary object (Session in this case) being replicated numerous times by the final relationship. This is why you are getting elevated Session counts.

You could use distinct, which counts each identifier only once. The answer by John Faz is correct, but you would only really need one distinct, not two, since the final table of the relationship (process) won't be replicated.

select
    host_id = H.ID,
    session_count = count(distinct S.ID),
    process_count = count(P.ID)
    from host H
        left join sessions S on H.ID = S.host_id
        left join process as P on S.ID = P.session_id
    group by H.ID

Another option would be to perform your count in multiple stages using a CTE. I think this would be less performant, particularly if you have a larger set of data, but it accurately models the counts you're trying to do.

;with cteSessions (session_id, host_id, process_count) as (
    select
        session_id = S.ID,
        S.host_id,
        process_count = count(1)
        from sessions S
            left join process P on S.ID = P.session_id
        group by
            S.ID,
            S.host_id
)
select
    host_id = H.ID,
    session_count = count(S.session_id),
    process_count = sum(isnull(s.process_count, 0))
    from host H
        left join cteSessions S on H.ID = S.host_id
    group by 
        H.ID

You could also use sub-queries. Which I hate, but it would work

select
    host_id = H.ID,
    session_count = (select count(1) from sessions s where s.host_id = H.ID),
    process_count = (select count(1) from sessions s join process p on s.id = p.session_id where s.host_id = H.ID)
    from host H

The CTE version given here returns a wrong result if there is a host that doesn't have any sessions or processes. In which case it incorrectly counts 1 instead of 0 for the sessions and gives null for the processes.

Collectives™ on Stack Overflow

SQL count using multiple join

4 Answers 4

9 Comments

Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

9 Comments

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related