SQL query uses "wrong" join

Question

I have an query which gives me the wrong result.

Tables:

A
+----+
| id |
+----+
|  1 |
|  2 |
+----+

B
+----+----+
| id |  x |  B.id = A.id
+----+----+
|  1 |  1 |
|  1 |  1 |
|  1 |  0 |
+----+----+

C
+----+----+
| id |  y |  C.id = A.id
+----+----+
|  1 |  1 |
|  1 |  2 |
+----+----+

What I want to do: Select all rows from A. For each row in A count in B all x with value 1 and all x with value 0 with B.id = A.id. For each row in A get the minimum y from C with C.id = A.id.

The result I am expecting is:

+----+------+--------+---------+
| id |  min | count1 | count 2 |
+----+------+--------+---------+
|  1 |    1 |      2 |       1 |
|  2 | NULL |      0 |       0 |
+----+------+--------+---------+

First Try: This doesn't work.

SELECT a.id,
       MIN(c.y),
       SUM(IF(b.x = 1, 1, 0)),
       SUM(IF(b.x = 0, 1, 0))
FROM   a
       LEFT JOIN b
              ON ( a.id = b.id )
       LEFT JOIN c
              ON ( a.id = c.id )
GROUP BY a.id

+----+------+--------+---------+
| id |  min | count1 | count 2 |
+----+------+--------+---------+
|  1 |    1 |      4 |       2 |
|  2 | NULL |      0 |       0 |
+----+------+--------+---------+

Second Try: This works but I am sure it has a bad performance.

SELECT a.id,
       MIN(c.y),
       b.x,
       b.y
FROM   a
       LEFT JOIN (SELECT b.id, SUM(IF(b.x = 1, 1, 0)) x, SUM(IF(b.x = 0, 1, 0)) y FROM b) b
              ON ( a.id = b.id )
       LEFT JOIN c
              ON ( a.id = c.id )
GROUP BY a.id

+----+------+--------+---------+
| id |  min | count1 | count 2 |
+----+------+--------+---------+
|  1 |    1 |      2 |       1 |
|  2 | NULL |      0 |       0 |
+----+------+--------+---------+

Last Try: This works too.

SELECT x.*,
       SUM(IF(b.x = 1, 1, 0)),
       SUM(IF(b.x = 0, 1, 0))
FROM   (SELECT a.id,
               MIN(c.y)
        FROM   a
               LEFT JOIN c
                      ON ( a.id = c.id )
        GROUP  BY a.id) x
       LEFT JOIN b
              ON ( b.id = x.id )
GROUP  BY x.id

Now my question is: Is the last one the best choise or is there a way to write this query with just one select statement (like in the first try)?

what RBDMS? MySQL, PostgreSQL?

roman
– roman

2013-08-15 14:49:11 +00:00
Commented Aug 15, 2013 at 14:49 — roman
– roman, Commented Aug 15, 2013 at 14:49
I use MySQL as database

KN4CK3R
– KN4CK3R

2013-08-15 16:05:15 +00:00
Commented Aug 15, 2013 at 16:05 — KN4CK3R
– KN4CK3R, Commented Aug 15, 2013 at 16:05

Gordon Linoff · Accepted Answer · 2013-08-15 22:10:20Z

3

Your joins are doing cartesian products for a given value, because there are multiple rows in each table.

You can fix this by using count(distinct) rather than sum():

SELECT a.id, MIN(c.y),
       count(distinct (case when b.x = 1 then b.id end)),
       count(distinct (case when b.x = 0 then b.id end))
FROM   a
       LEFT JOIN b
              ON ( a.id = b.id )
       LEFT JOIN c
              ON ( a.id = c.id )
GROUP BY a.id;

You can also fix this by pre-aggregating b (and/or c). And you would need to take that approach if your aggregation function were something like the sum of a column in b.

EDIT:

You are correct. The above query counts the distinct values of B, but B contains rows that are exact duplicates. (Personally, I think having a column with the name id that has duplicates is a sign of poor design, but that is another issue.)

You could solve it by having a real id in the b table, because then the count(distinct) would count the correct values. You can also solve it by aggregating the two tables before joining them in:

SELECT a.id, c.y, x1, x0
FROM   a
       LEFT JOIN (select b.id,
                         sum(b.x = 1) as x1,
                         sum(b.x = 0) as x0
                  from b
                  group by b.id
                 ) b
              ON ( a.id = b.id )
       LEFT JOIN (select c.id, min(c.y) as y
                  from c
                  group by c.id
                 ) c
              ON ( a.id = c.id );

Here is a SQL Fiddle for the problem.

EDIT II:

You can get it in one statement, but I'm not so sure that it would work on similar data. The idea is that you can count all the cases where x = 1 and then divide by the number of rows in the C table to get the real distinct count:

SELECT a.id, MIN(c.y), 
       coalesce(sum(b.x = 1), 0) / count(distinct coalesce(c.y, -1)), 
       coalesce(sum(b.x = 0), 0) / count(distinct coalesce(c.y, -1))
FROM   a
       LEFT JOIN b
              ON ( a.id = b.id )
       LEFT JOIN c
              ON ( a.id = c.id )
GROUP BY a.id;

It is a little tricky, because you have to handle NULLs to get the right values. Note that this is counting the y value to get a distinct count from the C table. Your question re-enforces why it is a good idea to have a unique integer primary key in every table.

edited Aug 15, 2013 at 22:10

answered Aug 15, 2013 at 14:49

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

KN4CK3R Over a year ago

Thank you, but this query doesn't give me the correct result. It returns 1, 1, 1, 1 instead of 1, 1, 2, 1.

KN4CK3R Over a year ago

Thank you again. Your second edit is like my second try and this will not have a good performance if there are lots of entries in B because of many unneeded sums. And your last edit works but it is a hack I don't want to use just to get it in one query. I will stay with my last try version.

Collectives™ on Stack Overflow

SQL query uses "wrong" join

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related