2

I have three tables related to each other in the following way:

  1. host (has multiple sessions)
  2. session (has multiple processes)
  3. processes

The table structure is as follows:

  1. host table - id, name
  2. session table - id, host_id, name
  3. process table - id, session_id, name

What I am trying to achieve is the count of the number of sessions and the count of the number of processes on each host. To achieve this I tried the following query, but the output is wrong.

select host.id, 
       count(sessions.id) as "session count", 
       count(process.id) as "process count"
from host as host
     left outer join sessions as sessions on host.id = sessions.host_id
     left outer join process as process on sessions.id = process.session_id
group by host.id;

Here's the SQLFiddle to the schema.

As per the data in the fiddle, the output should be:

id | session count | process count 
----------------------------------
1  |     2         |   3
2  |     1         |   2
3  |     1         |   2
4  |     2         |   3

But what I get is:

id | session count | process count 
----------------------------------
1  |     3         |   3
2  |     2         |   2
3  |     2         |   2
4  |     3         |   3

What can be the correct query to get the desired output?

4 Answers 4

6

Distinct;

select host.id, 
       count(distinct sessions.id) as "session count", 
       count(distinct process.id) as "process count"
from host as host
     left outer join sessions as sessions on host.id = sessions.host_id
     left outer join process as process on sessions.id = process.session_id
group by host.id;
Sign up to request clarification or add additional context in comments.

9 Comments

I think distinct isn't necessary for the process.
Hi Olivier. I know there are numerous ways to get the outcome but this was the easiest tweak to the existing code.
@OlivierJacot-Descombes the distinct keyword gives the expected output. However could you suggest what can be the other ways? I would like them to know. They might be better to integrate with my ORM. I am struggling to integrate the use of distinct in my ORM.
The solution of @user1529235 is okay, but you can just drop the distinct for the processes, as the query yields distinct process ids anyway. It doesn't yield distinct session ids so. Query select host.id, sessions.id as sid, process.id as pid from ... (without group by) and you will see.
I actually think that given the schema as is using a count(distinct [field]) is probably the best option. I'll add a more complex answer below.
|
1

If you query without the group by-clause, you will see that you are getting the same session id multiple times. Therefore you sessions count is too high.

select h.id as hid, s.id as sid, p.id as pid
from host h
left join sessions s on h.id = s.host_id
left join process p on s.id = p.session_id
order by h.id, s.id, p.id;

hid sid pid
-----------
1   1   1
1   1   2
1   2   5
2   5   8
2   5   9
3   3   3
3   3   7
4   4   4
4   4   6
4   6   10

Therefore use count(distinct s.id) for the sessions:

select h.id as hid, count(distinct s.id) as session_count, count(p.id) as process_count
from host h
left join sessions s on h.id = s.host_id
left join process p on s.id = p.session_id
group by h.id

Comments

1

John Faz's answer is better, however as you asked for other ways, it is possible to do this with sub queries as well like this:

select
  host.id,
  (select count(*) from sessions where host_id = host.id) as "session count",
  (select count(*) from process join sessions on process.session_id = sessions.id where sessions.host_id = host.id)  as "process count"
 from
   host

EDIT:

Actually I take back that bit about John Faz's answer being better. I just ran an execution plan over the two and my query took 28% and John's took 50% (22% set up and tear down). I was using only the very small amount of data from the SQL Fiddle example and with big data and different index choices things are likely to be different. However it does show that this query may be better in some circumstances.

2 Comments

I think it's a common mistake to assume that faster = better. Speed is certainly important, but it's not the only important thing to consider. maintainability/readability of code often counts for a lot, and sub-queries are almost always incredibly to maintain
@theo Agreed, wasn't suggesting this is always better, only sometimes. Also I was trying show surprise that this option is ever better. To know what's best would require way more knowledge about the circumstances.
0

The real issue here is that you have a chain of 1 to many relationships you are working with. If it were just one relationship in the chain a count() function would work fine with no issues. But having them chained together results in the intermediary object (Session in this case) being replicated numerous times by the final relationship. This is why you are getting elevated Session counts.

You could use distinct, which counts each identifier only once. The answer by John Faz is correct, but you would only really need one distinct, not two, since the final table of the relationship (process) won't be replicated.

select
    host_id = H.ID,
    session_count = count(distinct S.ID),
    process_count = count(P.ID)
    from host H
        left join sessions S on H.ID = S.host_id
        left join process as P on S.ID = P.session_id
    group by H.ID

Another option would be to perform your count in multiple stages using a CTE. I think this would be less performant, particularly if you have a larger set of data, but it accurately models the counts you're trying to do.

;with cteSessions (session_id, host_id, process_count) as (
    select
        session_id = S.ID,
        S.host_id,
        process_count = count(1)
        from sessions S
            left join process P on S.ID = P.session_id
        group by
            S.ID,
            S.host_id
)
select
    host_id = H.ID,
    session_count = count(S.session_id),
    process_count = sum(isnull(s.process_count, 0))
    from host H
        left join cteSessions S on H.ID = S.host_id
    group by 
        H.ID

You could also use sub-queries. Which I hate, but it would work

select
    host_id = H.ID,
    session_count = (select count(1) from sessions s where s.host_id = H.ID),
    process_count = (select count(1) from sessions s join process p on s.id = p.session_id where s.host_id = H.ID)
    from host H

1 Comment

The CTE version given here returns a wrong result if there is a host that doesn't have any sessions or processes. In which case it incorrectly counts 1 instead of 0 for the sessions and gives null for the processes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.