1

I am hoping to get some advice on how to optimize the performance of this query I have with an outer join. First I will explain what I am trying to do and then I'll show the code and results.

I have an Accounts table that has a list of all customer accounts. And I have a datausage table which keeps track of how much data each customer is using. A backend process running on multiple servers inserts records into the datausage table each day to keep track of how much usage occurred that day for each customer on that server.

The backend process works like this - if there is no activity on that server for an account on that day, no records are written for that account. If there is activity, one record is written with a "LogDate" of that day. This is happening on multiple servers. So collectively the datausage table winds up with no rows (no activity at all for that customer each day), one row (activity was only on one server for that day), or multiple rows (activity was on multiple servers for that day).

We need to run a report that lists ALL customers, along with their usage for a specific date range. Some customers may have no usage at all (nothing whatsoever in the datausage table). Some customers may have no usage at all for the current period (but usage in other periods).

Regardless of whether there is any usage or not (ever, or for the selected period) we need EVERY customer in the Accounts table to be listed in the report, even if they show no usage. Therefore it seems this required an outer join.

Here is the query I am using:

SELECT
   Accounts.accountID as AccountID,
   IFNULL(Accounts.name,Accounts.accountID) as AccountName,
   AccountPlans.plantype as AccountType,
   Accounts.status as AccountStatus,
   date(Accounts.created_at) as Created,
   sum(IFNULL(datausage.Core,0) + (IFNULL(datausage.CoreDeluxe,0) * 3)) as 'CoreData'
FROM `Accounts` 
 LEFT JOIN `datausage` on `Accounts`.`accountID` = `datausage`.`accountID`
 LEFT JOIN `AccountPlans` on `AccountPlans`.`PlanID` = `Accounts`.`PlanID`
WHERE
(
   (`datausage`.`LogDate` >= '2014-06-01' and `datausage`.`LogDate` < '2014-07-01') 
   or `datausage`.`LogDate` is null
) 
GROUP BY Accounts.accountID 
ORDER BY `AccountName` asc 

This query takes about 2 seconds to run. However it only takes 0.3 seconds to run if the "or datausage.LogDate is NULL" is removed. However, it seems I must have that clause in there, because accounts with no usage are excluded from the result set if that does not appear.

Here is the table data:

| id | select_type | table        | type   | possible_keys                                           | key     | key_len | ref                  | rows  | Extra                                                  |
+----+-------------+--------------+--------+---------------------------------------------------------+---------+---------+----------------------+-------    +----------------------------------------------------+
|  1 | SIMPLE      | Accounts     | ALL    | PRIMARY,accounts_planid_foreign,accounts_cardid_foreign | NULL    | NULL    | NULL                 |    57 | Using     temporary; Using filesort                    |
|  1 | SIMPLE      | datausage   | ALL    | NULL                                                    | NULL    | NULL    | NULL                 | 96805 | Using where;     Using join buffer (Block Nested Loop) |
|  1 | SIMPLE      | AccountPlans | eq_ref | PRIMARY                                                 | PRIMARY | 4       | mydb.Accounts.planID |     1 | NULL                                                   |

The indexes on Accounts table are as follows:

| Table    | Non_unique | Key_name                | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Accounts |          0 | PRIMARY                 |            1 | accountID   | A         |          57 |     NULL | NULL   |      | BTREE      |         |               |
| Accounts |          1 | accounts_planid_foreign |            1 | planID      | A         |           5 |     NULL | NULL   |      | BTREE      |         |               |
| Accounts |          1 | accounts_cardid_foreign |            1 | cardID      | A         |           0 |     NULL | NULL   | YES  | BTREE      |         |               |

The index on the datausage table is as follows:

| Table      | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| datausage |          0 | PRIMARY  |            1 | UsageID     | A         |       96805 |     NULL | NULL   |      | BTREE      |         |               |

I tried creating different indexes on datausage to see if it would help, but nothing did. I tried an index on AccountID, an index on AccountID, LogData, and index on LogData, AccountID, and an index on LogData. None of these made any difference.

I also tried using a UNION ALL with one of the queries with the logdata range and the other query just where logdata is null, but the result was about the same (actually a bit worse).

Can someone please help me understand what may be going on and the ways in which I can optimize the query execution time? Thank you!!

UPDATE: At Philipxy's request, here are the table definitions. Note that I removed some columns and constraints that are not related to this query to help keep things as tight and clean as possible.

CREATE TABLE `Accounts` (
   `accountID` varchar(25) NOT NULL,
   `name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
   `status` int(11) NOT NULL,
   `planID` int(10) unsigned NOT NULL DEFAULT '1',
   `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
   PRIMARY KEY (`accountID`),
   KEY `accounts_planid_foreign` (`planID`),
   KEY `acctname_id_ndx` (`name`,`accountID`),
   CONSTRAINT `accounts_planid_foreign` FOREIGN KEY (`planID`) REFERENCES `AccountPlans` (`planID`)
   ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 


CREATE TABLE `datausage` (
   `UsageID` int(11) NOT NULL AUTO_INCREMENT,
   `Core` int(11) DEFAULT NULL,
   `CoreDelux` int(11) DEFAULT NULL,
   `AccountID` varchar(25) DEFAULT NULL,
   `LogDate` date DEFAULT NULL
   PRIMARY KEY (`UsageID`),
   KEY `acctusage` (`AccountID`,`LogDate`)
   ) ENGINE=MyISAM AUTO_INCREMENT=104303 DEFAULT CHARSET=latin1 


CREATE TABLE `AccountPlans` (
   `planID` int(10) unsigned NOT NULL AUTO_INCREMENT,
   `name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
   `params` text COLLATE utf8_unicode_ci NOT NULL,
   `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
   `plantype` varchar(25) COLLATE utf8_unicode_ci NOT NULL,
   PRIMARY KEY (`planID`),
   KEY `acctplans_id_type_ndx` (`planID`,`plantype`)
 ) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci 
12
  • What are the table defintions? In particular, what columns can be null and in the cases you timed have nulls? Commented Jun 13, 2014 at 7:21
  • It is helpful to post SQL table definitions sufficient for sqlfiddle.com for us to test with. Commented Jun 13, 2014 at 8:01
  • Please update your explain for current keys for your query & some answer query. Helpful if your definitions ran in sqlfiddle (typos & order) plus sample data values so answerers could generate explain. Re engines: For baseline try all innodb with FK datausage accountid to accounts. (No mysql innodb-isam fks.) FKs can be extremely important. Commented Jun 17, 2014 at 0:45
  • Hi philipxy - you want the EXPLAIN output for which of the queries? My original one, or one of the other ones posted here (and if so, which one)? It will be over a week before I can update the system to convert the table from myISAM to InnoDB. I am thinking that may be the primary issue. And making LogData defined as NOT NULL may help. I will post back here near the end of the month with those results when I am able to try it on the system. Commented Jun 17, 2014 at 19:22
  • Hi philipxy and all - I was finally able to resume working on this. First thing I tried was converting the datausage table from myIASM to InnoDB. Surprisingly that did not make any usage. Then I changed the LogDate column to NOT NULL, which did not help either. I created the SQL fiddle you asked for at: sqlfiddle.com/#!9/f3259/4/0 . Note that all versions of the queries presented here (including my own) have the same EXPLAIN plan. Please note that although this query is fast in the fiddle, it only has a dozen rows in datausage there. In our real system it has nearly 100,000 rows. Thx! Commented Jun 30, 2014 at 4:00

2 Answers 2

1

First, you can simplify the query by moving the where clause to the on clause:

SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
       ap.plantype as AccountType, a.status as AccountStatus,
       date(a.created_at) as Created,
       sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN 
     datausage du
     on a.accountID = du.`accountID` AND
        du.`LogDate` >= '2014-06-01' and du.`LogDate` < '2014-07-01'
LEFT JOIN 
     AccountPlans ap
     on ap.`PlanID` = a.`PlanID`
GROUP BY a.accountID 
ORDER BY AccountName asc ;

(I also introduced table aliases to make the query easier to read.)

This version should make better uses of indexes because it eliminates the or in the where clause. However, it still won't use an index for the outer sort. The following might be better:

SELECT a.accountID as AccountID, coalesce(a.name, a.accountID) as AccountName,
       ap.plantype as AccountType, a.status as AccountStatus,
       date(a.created_at) as Created,
       sum(coalesce(du.Core, 0) + (coalesce(du.CoreDeluxe, 0) * 3)) as CoreData
FROM Accounts a LEFT JOIN 
     datausage du
     on a.accountID = du.`accountID` AND
        du.LogDate >= '2014-06-01' and du.LogDate < '2014-07-01'LEFT JOIN 
     AccountPlans ap
     on ap.PlanID = a.PlanID
GROUP BY a.accountID 
ORDER BY a.name, a.accountID ;

For this, I would recommend the following indexes:

Accounts(name, AccountId)
Datausage(AccountId, LogDate)
AccountPlans(PlanId, PlanType)
Sign up to request clarification or add additional context in comments.

5 Comments

@user2838966 Note that by moving the WHERE constraint, Gordon has preserved the 'outerness' of your join, which was lost in your version of the query.
Thank you Strawberry! I just tried the last SQL statement you provided, along with the indexes, and the query still took 2 seconds to run. No different than what I had before. At first glance the EXPLAIN output is identical as well. Any other ideas?
Also - I just tried the first query you wrote and it performs no different than the first. So in summary, both your queries and my query all take 2 seconds. What else could we try? Thank you.
Gordon - meant to say thank you above, but wrote "Strawberry". Tho I'm thankful for his help too. Please let me know if you think there is something that can improve this.
Those two queries only differ by ORDER BY.
1

When you left join with datausage you should restrict the output as much as possible right there. (JOIN means AND means WHERE means ON. Put the conditions in essentially whatever order will be clear and/or optimize when necessary.) The result will be a null-extended row when there was no usage; you want to leave that row in.

When you join with AccountPlans you don't want to introduce null rows (which can't happen anyway) so that's just an inner join.

The version below has the AccountPlan join as an inner join and put first. (Indexed) Accounts FK PlanID to AccountPlan means the DBMS knows the inner join will only ever generate one row per Accounts PK. So the output has key AccountId. That row can be immediately inner joined to datausage. (An index on its AccountID should help, eg for a merge join.) For the other way around there is no PlanID key/index on the outer join result to join with AccountPlan.

SELECT
   a.accountID as AccountID,
   IFNULL(a.name,a.accountID) as AccountName,
   ap.plantype as AccountType,
   a.status as AccountStatus,
   date(a.created_at) as Created,
   sum(IFNULL(du.Core,0) + (IFNULL(du.CoreDeluxe,0) * 3)) as CoreData
FROM Accounts a
 JOIN AccountPlans ap ON ap.PlanID = a.PlanID
 LEFT JOIN datausage du ON a.accountID = du.accountID AND du.LogDate >= '2014-06-01' AND du.LogDate < '2014-07-01'
GROUP BY a.accountID

6 Comments

Thank you philipxy. Unfortunately your query runs slower (2.4 seconds) then the 2.0 seconds that my original query and the two suggested by Gordon take (also 2.0 seconds). The EXPLAIN output is the same in all versions of these queries tried so far. The indexes I have are as Gordon suggested. Where to from here? Thanks!
Also I meant to add that the inclusion of the AccountPlans table and plantype have no bearing on the query speed on any of the queries presented here (including my own). I was going to leave it out in the example just to keep things simple, but I figured I better include it because then once a winning query was presented I may not know how to work that part of the query back in.
Does anyone have suggestions I can try? So far the performance of these queries is no different that my original. Thanks.
As I asked a while ago, please sent your table definitions, because they affect optimization, in particular whether a column can be null.
Thanks for your message philipxy! Sorry, I did not see your previous request for the table definitions. I just added them to the original post above. Note that I cut out some column names that aren't used in the query to help keep it brief. Regarding NULL columns, in case it matters, yes LogDate could definitely be redefined to be NOT NULL, since all rows in datausage will always have a LogDate. Except of course it will always be NULL when returned from the outer join with no match. Also I just noticed that the datausage table is myISAM whereas the other tables are InnoDB in case that matters?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.