0

I'm having some trouble writing succinct code to generate the desired result efficiently (on a multiple million records DB).

  • items will be grouped by time
  • items will be selected by provider being that B takes precedence over A (and C over B)
  • value must match value of selected provider

Table vs wanted result:

// given this table
id | provider | time       | value
---+----------+------------+-----------
 1 |    A     | 2013-07-01 |  0.1
 2 |    A     | 2013-07-02 |  0.2
 3 |    B     | 2013-07-02 |  0.3
 4 |    A     | 2013-07-03 |  0.4

// extrapolate this result
---+----------+------------+-----------
1  |   A      | 2013-07-01 |  0.1
3  |   B      | 2013-07-02 |  0.3
4  |   A      | 2013-07-03 |  0.4

The queries to generate table and populate data:

data_teste CREATE TABLE `data_teste` (`id` int(11) unsigned NOT NULL AUTO_INCREMENT,`provider` varchar(12) NOT NULL,`time` date NOT NULL,`value` double NOT NULL,PRIMARY KEY (`id`),UNIQUE KEY `index` (`provider`,`time`),KEY `provider` (`provider`),KEY `time` (`time`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO data_teste(`provider`, `time`, `value`) VALUES('A', '2013-07-01', 0.1),('A', '2013-07-02', 0.2),('B', '2013-07-02', 0.3),('A', '2013-07-03', 0.4);

This is the classic group_by/sort problem windowed.

Thank you very much.

2
  • I think you should explain why that is the desired result. Commented Jul 18, 2013 at 17:15
  • @EvanMulawski there are different providers than bring in data on a time-series but when there is overlap in the time field I want the data from provider B to take precedence over A (and etecetera) Commented Jul 18, 2013 at 17:22

2 Answers 2

1
select d.* 
from data_teste d
inner join
(
   select `time`, max(provider) mp
   from data_teste
   group by `time`
) x on x.mp = d.provider 
    and x.`time` = d.`time`
order by  `time` asc, 
          provider desc
Sign up to request clarification or add additional context in comments.

3 Comments

This is definitely more elegant than what I had. Still performs slow (up to 4 seconds) but it's so much more concise that I probably can speed it up by requesting only limited buckets of time. Thanks!
You can use explain select ... to see where the performance bottleneck is.
All the indexes looks good. The main bottleneck is actually having to group with such a dataset size. It can, and is, controllable by limiting it's time range. Thanks!
0

How well does this perform?

SELECT 
  *
FROM 
  `data_teste` dt1 
   LEFT JOIN `data_teste` dt2 ON ( dt2.time = dt1.time 
                                    AND dt2.provider > dt1.provider )
WHERE 
  dt2.ID IS NULL

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.