1

I have the following SQL query

SELECT  *
    FROM  `sensor_data` AS `sd1`
    WHERE  (sd1.timestamp BETWEEN '2017-05-13 00:00:00'
                              AND '2017-05-14 00:00:00')
      AND  (`id` = 
            (
                SELECT  `id`
                    FROM  `sensor_data` AS `sd2`
                    WHERE  sd1.mid = sd2.mid
                      AND  sd1.sid = sd2.sid
                    ORDER BY  `value` DESC, `id` DESC
                    LIMIT  1)
           ) 

Background:

I've checked the validity of the query by changing LIMIT 1 to LIMIT 0, and the query works without any problem. However with LIMIT 1 the query doesn't complete, it just states loading until I shutdown and restart.

Breaking the Query down:

I have broken down the query with the date boundary as follows:

SELECT  *
    FROM  `sensor_data` AS `sd1`
    WHERE  (sd1.timestamp BETWEEN '2017-05-13 00:00:00'
                              AND '2017-05-14 00:00:00') 

This takes about 0.24 seconds to return the query with 8200 rows each having 5 columns.

Question:

I suspect the second half of my Query, is not correct or well optimized. The tables are as follows:

Current Table:

+------+-------+-------+-----+-----------------------+
|  id  |  mid  |  sid  |  v  |       timestamp       |
+------+-------+-------+-----+-----------------------+
|  51  |  10   |   1   |  40 |  2015-05-13 11:56:01  |
|  52  |  10   |   2   |  39 |  2015-05-13 11:56:25  |
|  53  |  10   |   2   |  40 |  2015-05-13 11:56:42  |
|  54  |  10   |   2   |  40 |  2015-05-13 11:56:45  |
|  55  |  10   |   2   |  40 |  2015-05-13 11:57:01  |
|  56  |  11   |   1   |  50 |  2015-05-13 11:57:52  |
|  57  |  11   |   2   |  18 |  2015-05-13 11:58:41  |
|  58  |  11   |   2   |  19 |  2015-05-13 11:58:59  |
|  59  |  11   |   3   |  58 |  2015-05-13 11:59:01  |
|  60  |  11   |   3   |  65 |  2015-05-13 11:59:29  |
+------+-------+-------+-----+-----------------------+

Q: How would I get the MAX(v)for each sid for each mid? NB#1: In the example above ROW 53, 54, 55 have all the same value (40), but I would like to retrieve the row with the most recent timestamp, which is ROW 55.

Expected Output:

+------+-------+-------+-----+-----------------------+
|  id  |  mid  |  sid  |  v  |       timestamp       |
+------+-------+-------+-----+-----------------------+
|  51  |  10   |   1   |  40 |  2015-05-13 11:56:01  |
|  55  |  10   |   2   |  40 |  2015-05-13 11:57:01  |
|  56  |  11   |   1   |  50 |  2015-05-13 11:57:52  |
|  58  |  11   |   2   |  19 |  2015-05-13 11:58:59  |
|  60  |  11   |   3   |  65 |  2015-05-13 11:59:29  |
+------+-------+-------+-----+-----------------------+

Structure of the table:

enter image description here

NB#2: Since this table has over 110 million entries, it is critical to have have date boundaries, which limits to ~8000 entries over a 24 hour period.

3 Answers 3

1

The query can be written as follows:

SELECT t1.id, t1.mid, t1.sid, t1.v, t1.ts
FROM yourtable t1
INNER JOIN (
        SELECT mid, sid, MAX(v) as v 
            FROM yourtable
        WHERE ts BETWEEN '2015-05-13 00:00:00' AND '2015-05-14 00:00:00'
        GROUP BY mid, sid
    ) t2
    ON  t1.mid = t2.mid
    AND t1.sid = t2.sid
    AND t1.v   = t2.v
INNER JOIN (
        SELECT mid, sid, v, MAX(ts) as ts
            FROM yourtable
        WHERE ts BETWEEN '2015-05-13 00:00:00' AND '2015-05-14 00:00:00'
        GROUP BY mid, sid, v
    ) t3
    ON  t1.mid = t3.mid
    AND t1.sid = t3.sid
    AND t1.v   = t3.v
    AND t1.ts  = t3.ts; 

Edit and Explanation:

The first sub-query (first INNER JOIN) fetches MAX(v) per (mid, sid) combination. The second sub-query is to identify MAX(ts) for every (mid, sid, v). At this point, the two queries do not influence each others' results. It is also important to note that ts date range selection is done in the two sub-queries independently such that the final query has fewer rows to examine and no additional WHERE filters to apply.

Effectively, this translates into getting MAX(v) per (mid, sid) combination initially (first sub-query); and if there is more than one record with the same value MAX(v) for a given (mid, sid) combo, then the excess records get eliminated by the selection of MAX(ts) for every (mid, sid, v) combination obtained by the second sub-query. We then simply associate the output of the two queries by the two INNER JOIN conditions to get to the id of the desired records.

Demo

Sign up to request clarification or add additional context in comments.

6 Comments

I saw a comment before (which is now deleted) about wanting to add date range checks, but I couldn't reply then. If you were to add a WHERE condition for date range check, I'd recommend doing so in the subquery. As an aside, any ORDER BY should be a part of the main query (adding it in subquery would usually be futile).
I'll also need the row with the most recent timestamp which meet these conditions, for further clarification please have a read of NB#1. I have changed the sample table by adding rows 54 and 55.
@3kstc Please tell me if I'm wrong, but isn't this technically equivalent to then possibly having two exactly same max(v) for a given (mid, sid) pair, but recorded into the table with different timestamps?
that is correct... the id would be unique as it is a primary key, we could use something like the max(id) for a corresponding max(v) for a given (mid ,sid) pair.
@3kstc Please check the edit. If the table is indexed properly, hopefully the revised query should produce the expected results in a reasonable time.
|
0

select * from sensor_data s1 where s1.v in (select max(v) from sensor_data s2 group by s2.mid)

union

select * from sensor_data s1 where s1.v in (select max(v) from sensor_data s2 group by s2.sid);

Comments

0

IN ( SELECT ... ) does not optimize well. It is even worse because of being correlated.

What you are looking for is a groupwise-max .

Please provide SHOW CREATE TABLE; we need to know at least what the PRIMARY KEY is.

Suggested code

You will need:

  • With the WHERE: INDEX(timestamp, mid, sid, v, id)
  • Without the WHERE: INDEX(mid, sid, v, timestamp, id)

Code:

SELECT id, mid, sid, v, timestamp
    FROM ( SELECT @prev_mid := 99999,   -- some value not in table
                  @prev_sid := 99999,
                  @n := 0 ) AS init
    JOIN (
             SELECT  @n := if(mid != @prev_mid OR
                              sid != @prev_sid,
                              1, @n + 1) AS n,
                     @prev_mid := mid,
                     @prev_sid := sid,
                     id, mid, sid, v, timestamp
                 FROM sensor_data
                 WHERE timestamp >= '2017-05-13'
                       timestamp  < '2017-05-13' + INTERVAL 1 DAY
                 ORDER BY mid DESC, sid DESC, v DESC, timestamp DESC
         ) AS x
    WHERE n = 1
    ORDER BY mid, sid;   -- optional

Notes:

  • The index is 'composite' and 'covering'.
  • This should make one pass over the index, thereby providing 'good' performance.
  • The final ORDER BY is optional; the results may be in reverse order.
  • All the DESC in the inner ORDER BY must be in place to work correctly (unless you are using MySQL 8.0).
  • Note how the WHERE avoids including both midnights? And avoids manually computing leap-days, year-ends, etc?
  • With the WHERE (and associated INDEX), there will be filtering, but a 'sort'.
  • Without the WHERE (and the other INDEX), sort will not be needed.

You can test the performance of any competing formulations via this trick, even if you do not have enough rows (yet) to get reliable timings:

FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%';

This can also be used to compare different versions of MySQL and MariaDB -- I have seen 3 significantly different performance characteristics in a related groupwise-max test.

4 Comments

I've placed my screenshot, the PRIMARY KEY is the id
is there anything else I can give you from my end to optimise the search query?
I have tried implementing the groupwise-max but I am seriously struggling :( Any help will be much appreciated :)
Oh, I see the struggle. Looks like a case that that link does not easily cover.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.