0

We've got a relatively straightforward query that does LEFT JOINs across 4 tables. A is the "main" table or the top-most table in the hierarchy. B links to A, C links to B. Furthermore, X links to A. So the hierarchy is basically

A
C => B => A
X => A

The query is essentially:

SELECT
    a.*, b.*, c.*, x.*
FROM
    a
    LEFT JOIN b ON b.a_id = a.id
    LEFT JOIN c ON c.b_id = b.id
    LEFT JOIN x ON x.a_id = a.id
WHERE
    b.flag = true
ORDER BY
    x.date DESC
LIMIT 25

Via EXPLAIN, I've confirmed that the correct indexes are in place, and that the built-in MySQL query optimizer is using those indexes correctly and properly.

So here's the strange part...

When we run the query as is, it takes about 1.1 seconds to run.

However, after doing some checking, it seems that if I removed most of the SELECT fields, I get a significant speed boost.

So if instead we made this into a two-step query process:

  1. First query same as above except change the SELECT clause to only SELECT a.id instead of SELECT *
  2. Second query also same as above, except change the WHERE clause to only do an a.id IN agains the result of Query 1 instead of what we have before

The result is drastically different. It's .03 seconds for the first query and .02 for the second query.

Doing this two-step query in code essentially gives us a 20x boost in performance.

So here's my question:

Shouldn't this type of optimization already be done within the DB engine? Why does the difference in which fields that are actually SELECTed make a difference on the overall performance of the query?

At the end of the day, it's merely selecting the exact same 25 rows and returning the exact same full contents of those 25 rows. So, why the wide disparity in performance?

ADDED 2012-08-24 13:02 PM PDT

Thanks eggyal and invertedSpear for the feedback. First off, it's not a caching issue -- I've run tests running both queries multiple times (about 10 times) alternating between each approach. The result averages at 1.1 seconds for the first (single query) approach and .03+.02 seconds for the second (2 query) approach.

In terms of indexes, I thought I had done an EXPLAIN to ensure that we're going thru the keys, and for the most part we are. However, I just did a quick check again and one interesting thing to note:

The slower "single query" approach doesn't show the Extra note of "Using index" for the third line:

+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type   | possible_keys          | key               | key_len | ref                           | rows | Extra                                        |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | t1    | index  | PRIMARY                | shop_group_id_idx | 5       | NULL                          |  102 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | t2    | eq_ref | PRIMARY                | PRIMARY           | 4       | dbmodl_v18.t1.organization_id |    1 | Using where                                  |
|  1 | SIMPLE      | t0    | ref    | bundle_idx,shop_id_idx | shop_id_idx       | 4       | dbmodl_v18.t1.organization_id |  309 |                                              |
|  1 | SIMPLE      | t3    | eq_ref | PRIMARY                | PRIMARY           | 4       | dbmodl_v18.t0.id              |    1 |                                              |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+

While it does show "Using index" for when we query for just the IDs:

+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type   | possible_keys          | key               | key_len | ref                           | rows | Extra                                        |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | t1    | index  | PRIMARY                | shop_group_id_idx | 5       | NULL                          |  102 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | t2    | eq_ref | PRIMARY                | PRIMARY           | 4       | dbmodl_v18.t1.organization_id |    1 | Using where                                  |
|  1 | SIMPLE      | t0    | ref    | bundle_idx,shop_id_idx | shop_id_idx       | 4       | dbmodl_v18.t1.organization_id |  309 | Using index                                  |
|  1 | SIMPLE      | t3    | eq_ref | PRIMARY                | PRIMARY           | 4       | dbmodl_v18.t0.id              |    1 |                                              |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+

The strange thing is that both do list the correct index being used... but I guess it begs the questions:

Why are they different (considering all the other clauses are the exact same)? And is this an indication of why it's slower?

Unfortunately, the MySQL docs do not give much information for when the "Extra" column is blank/null in the EXPLAIN results.

2
  • If the columns can be fetched from the indexes without doing a lookup into the actual record (EXPLAIN shows Using index), you may see a significant perfomance boost. Is this what's happening? Commented Aug 24, 2012 at 18:48
  • Is it truly faster or are the tables just now cached in memory? Flush Tables and Reset query cache between tests will give you truer benchmarks. Commented Aug 24, 2012 at 19:03

1 Answer 1

1

More important than speed, you have a flaw in your query logic. When you test a LEFT JOINed column in the WHERE clause (other than testing for NULL), you force that join to behave as if it were an INNER JOIN. Instead, you'd want:

SELECT
    a.*, b.*, c.*, x.*
FROM
    a
    LEFT JOIN b ON b.a_id = a.id
        AND b.flag = true
    LEFT JOIN c ON c.b_id = b.id
    LEFT JOIN x ON x.a_id = a.id
ORDER BY
    x.date DESC
LIMIT 25

My next suggestion would be to examine all of those .*'s in your SELECT. Do you really need all the columns from all the tables?

Sign up to request clarification or add additional context in comments.

3 Comments

Interesting note about the JOIN clause... but strangely enough, moving the condition into the JOIN clause actually made it slower... when selecting all of the columns, EXPLAIN changed the "a" table query to be: | 1 | SIMPLE | t0 | ALL | NULL | NULL | NULL | NULL | 45813 | Using temporary; Using filesort | (not good -- did a full table scan) But again, interestingly enough, when i switched to only selecting a.id (alongside the new JOIN logic you suggested above), the query performs the exact same way as the 2-query approach from above.
Also, yes -- thanks for the note about .* -- Agreed, it's very rare that a "*" should be used in a SELECT clause. I simply did that here to make my post a bit more readable, but in our code, we are indeed explicitly selecting the columns that we are needing.
One big reason why select * is bad for performance - it will have to always pull from disk unless you have a covering index on all columns in the table. If instead you have index columns which can satisfy a more limited column set to return, its possible to satisfy the query result directly from the index.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.