0

I have been trying to optimise some SQL queries based on the assumption that Joining tables is more efficient than nesting queries. I am joining the same table multiple times to perform a different analysis on the data.

I have 2 tables:

transactions:

id    |   date_add    |   merchant_ id    | transaction_type      |     amount
1         1488733332          108                  add                     20.00
2         1488733550          108                 remove                   5.00

and a calendar table which just lists dates so that I can create empty records where there are no transactions on particular days:

calendar:

id     |    datefield
1           2017-03-01
2           2017-03-02
3           2017-03-03
4           2017-03-04

I have many thousands of rows in the transactions table, and I'm trying to get an annual summary of total and different types of transactions per month (i.e 12 rows in total), where

  • transactions = sum of all "amount"s,
  • additions = sum of all "amounts" where transaction_type = "add"
  • redemptions = sum of all "amounts" where transaction_type = "remove"

result:

month     |    transactions     |    additions  |   redemptions
Jan                15                  12               3
Feb                20                  15               5
...

My initial query looks like this:

SELECT  COALESCE(tr.transactions, 0) AS transactions, 
        COALESCE(ad.additions, 0) AS additions, 
        COALESCE(re.redemptions, 0) AS redemptions, 
        calendar.date 
FROM (SELECT DATE_FORMAT(datefield, '%b %Y') AS date FROM calendar WHERE datefield LIKE '2017-%' GROUP BY YEAR(datefield), MONTH(datefield)) AS calendar 
LEFT JOIN (SELECT COUNT(transaction_type) as transactions, from_unixtime(date_add, '%b %Y') as date_t FROM transactions WHERE merchant_id = 108  GROUP BY from_unixtime(date_add, '%b %Y')) AS tr
ON calendar.date = tr.date_t
LEFT JOIN (SELECT COUNT(transaction_type = 'add') as additions, from_unixtime(date_add, '%b %Y') as date_a FROM transactions WHERE merchant_id = 108  AND transaction_type = 'add' GROUP BY from_unixtime(date_add, '%b %Y')) AS ad
ON calendar.date = ad.date_a
LEFT JOIN (SELECT COUNT(transaction_type = 'remove') as redemptions, from_unixtime(date_add, '%b %Y') as date_r FROM transactions WHERE merchant_id = 108  AND transaction_type = 'remove' GROUP BY from_unixtime(date_add, '%b %Y')) AS re
ON calendar.date = re.date_r

I tried optimising and cleaning it up a little, removing the nested statements and came up with this:

SELECT 
    DATE_FORMAT(cal.datefield, '%b %d') as date,
    IFNULL(count(ct.amount),0) as transactions, 
    IFNULL(count(a.amount),0) as additions, 
    IFNULL(count(r.amount),0) as redeptions
FROM calendar as cal 
LEFT JOIN transactions as ct ON cal.datefield = date(from_unixtime(ct.date_add))  && ct.merchant_id = 108
LEFT JOIN transactions as r ON r.id = ct.id && r.transaction_type = 'remove'
LEFT JOIN transactions as a ON a.id = ct.id && a.transaction_type = 'add' 
WHERE cal.datefield like '2017-%'
GROUP BY month(cal.datefield)

I was surprised to see that the revised statement was about 20x slower than the original with my dataset. Have I missed some sort of logic? Is there a better way to achieve the same result with a more streamlined query, given I am joining the same table multiple times?

EDIT: So to further explain the results I'm looking for - I'd like a single row for each month of the year (12 rows) each with a column for the total transactions, total additions, and total redemptions in each month.

The first query I was getting a result in about 0.5 sec but with the second I was getting results in 9.5sec.

7
  • Can you add an explain and post the results for the optimized and the non-optimized queries? Commented Mar 5, 2017 at 17:21
  • I really wouldn't bother with a calendar table for this Commented Mar 5, 2017 at 17:28
  • I see you used && in the second query in the ON statements off the LEFT JOIN? They should be AND Commented Mar 5, 2017 at 17:32
  • Could you explain the columns in the result set? I believe this task can be solved in a less complex way. Commented Mar 5, 2017 at 17:33
  • @PaulSpiegel I added a couple of edits there to explain the expected results Commented Mar 5, 2017 at 17:38

2 Answers 2

3

Looking to your query You could use a single left join with case when

SELECT  COALESCE(t.transactions, 0) AS transactions, 
        COALESCE(t.additions, 0) AS additions, 
        COALESCE(t.redemptions, 0) AS redemptions, 
        calendar.date 
FROM (SELECT DATE_FORMAT(datefield, '%b %Y') AS date 
          FROM calendar 
          WHERE datefield LIKE '2017-%' 
          GROUP BY YEAR(datefield), MONTH(datefield)) AS calendar 
LEFT JOIN 
 ( select 
      COUNT(transaction_type) as transactions
      , sum( case when transaction_type = 'add' then 1 else 0 end ) as additions
      , sum( case when transaction_type = 'remove' then 1 else 0 end ) as redemptions
      ,  from_unixtime(date_add, '%b %Y') as date_t 
      FROM transactions 
      WHERE merchant_id = 108  
      GROUP BY from_unixtime(date_add, '%b %Y' ) t ON calendar.date = t.date_t
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks @scaisEdge the CASE WHEN is a revelation - muchos gracias! That's down from 0.5 sec to 0.2
The first SUM can be simplified to SUM(transaction_type = 'add').
If there is an index on datefield, datefield LIKE '2017-%' can be sped up by datefield >= '2017-01-01' AND datefield < '2017-01-01' + INTERVAL 1 YEAR.
I'm pretty sure the COALESCE is unnecessary -- SUM( nothing matches ) is 0 anyway.
@RickJames I have refactored the left join .. letting the main select as done by th OP ..
|
0

First I would create a derived table with timestamp ranges for every month from your calendar table. This way a join with the transactions table will be efficient if date_add is indexed.

select month(c.datefield) as month, 
       unix_timestamp(timestamp(min(c.datefield), '00:00:00')) as ts_from,
       unix_timestamp(timestamp(max(c.datefield), '23:59:59')) as ts_to
from calendar c
where c.datefield between '2017-01-01' and '2017-12-31'
group by month(c.datefield)

Join it with the transaactions table and use conditional aggregations to get your data:

select c.month,
       sum(t.amount) as transactions,
       sum(case when t.transaction_type = 'add'    then t.amount else 0 end) as additions,
       sum(case when t.transaction_type = 'remove' then t.amount else 0 end) as redemptions
from (
    select month(c.datefield) as m, date_format(c.datefield, '%b') as `month`
           unix_timestamp(timestamp(min(c.datefield), '00:00:00')) as ts_from,
           unix_timestamp(timestamp(max(c.datefield), '23:59:59')) as ts_to
    from calendar c
    where c.datefield between '2017-01-01' and '2017-12-31'
    group by month(c.datefield), date_format(c.datefield, '%b')
) c
left join transactions t on t.date_add between c.ts_from and c.ts_to
where t.merchant_id = 108
group by c.m, c.month
order by c.m

2 Comments

It is likely to be some faster if you make the ORDER BY match the GROUP BY. This may avoid an extra sort.
@RickJames You think the optimizer isn't able to "see" that ORDER BY is a subset of GROUP BY? However it doesn't really matter here since the resultset contains only 12 rows. I have another problem here. This query takes 2 seconds on 1M-row-table (~330K rows for 2014). But changing to inner join it runs in 100 msec. So i would need to put the code into another subquery. MySQL optimizer is sometimes really stupid.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.