2

I have a table where one or more entries with the same 'id' value can be inserted into our log / fact table (contains over 100+ million records)
At a set frequency a new record is inserted into this table with a new value for the columns 'created' and 'view_percent' (percentage of a video viewed).
With two different queries, I would like to return:

Desired Result 1:

+------------------+--------------+-------------+------------------+------------+
| archive_asset_id | asset_title  | count_asset | avg_view         | time_day   |
+------------------+--------------+-------------+------------------+------------+
|            83386 | Oliver James |           4 | 36.75            | 2010-08-09 |
+------------------+--------------+-------------+------------------+------------+

With this query I am not able to filter out the necessary records when performing the aggregate function... instead the average of all rows taken with a resulting value of 31.307

SELECT archive_asset_id, asset_title, COUNT(DISTINCT id * 1000000 + archive_asset_id) AS count_asset, AVG(view_percent) AS avg_view, FROM_UNIXTIME(created, '%Y-%m-%d') AS time_day
FROM log_embed_video 
WHERE archive_asset_id = 83386 
AND created >= 1281312000
AND created < 1281484800
GROUP BY time_day
ORDER BY time_day;

Desired Result 2:

+------------+------------------+---------------+------------------+-------------------------------+
| time_day   | archive_asset_id | asset_title  | MAX(view_percent) | occurrences MAX(view_percent) |
+------------+------------------+---------------+------------------+-------------------------------+
| 2010-08-09 |            83386 | Oliver James |                13 | 1                             |
| 2010-08-09 |            83386 | Oliver James |                17 | 2                             |
| 2010-08-09 |            83386 | Oliver James |               100 | 1                             |
+------------+------------------+---------------+-------------------+------------------------------+

This is the query that I have used for result 2, but not quite what I want... the group by log_embed_video.id yields 4 results... which is to be expected for the given query, but not the desired output.

SELECT id, FROM_UNIXTIME(created, '%Y-%m-%d') AS time_day, archive_asset_id, asset_title, COUNT(DISTINCT id * 1000000 + archive_asset_id) AS 'count_asset', MAX(view_percent) as 'max_view_percent'
FROM log_embed_video 
WHERE archive_asset_id = 83386 
AND created >= 1281312000
AND created < 1281484800
GROUP BY time_day, id



Conditioned Data:
The rows mark with KEEP is the data I want to work with when returning result 1 and result 2.

SELECT id, archive_asset_id, asset_title, view_percent, FROM_UNIXTIME(created, '%Y-%m-%d') AS time_day
FROM log_embed_video 
WHERE archive_asset_id = 83386 
AND created >= 1281312000
AND created < 1281484800
ORDER BY id, view_percent;
+----------+------------------+--------------+--------------+------------+
| id       | archive_asset_id | asset_title  | view_percent | time_day   |
+----------+------------------+--------------+--------------+------------+
| 43326898 |            83386 | Oliver James |            0 | 2010-08-09 | - DISCARD RECORD / DUPLICATE 
| 43326898 |            83386 | Oliver James |           13 | 2010-08-09 | + KEEP
| 43432090 |            83386 | Oliver James |            0 | 2010-08-09 | - DISCARD RECORD / DUPLICATE 
| 43432090 |            83386 | Oliver James |           17 | 2010-08-09 | + KEEP
| 43432092 |            83386 | Oliver James |            0 | 2010-08-09 | - DISCARD RECORD / DUPLICATE 
| 43432092 |            83386 | Oliver James |           17 | 2010-08-09 | + KEEP
| 43470093 |            83386 | Oliver James |            0 | 2010-08-09 | - DISCARD RECORD / DUPLICATE 
| 43470093 |            83386 | Oliver James |           17 | 2010-08-09 | - DISCARD RECORD / DUPLICATE 
| 43470093 |            83386 | Oliver James |           35 | 2010-08-09 | - DISCARD RECORD / DUPLICATE 
| 43470093 |            83386 | Oliver James |           52 | 2010-08-09 | - DISCARD RECORD / DUPLICATE 
| 43470093 |            83386 | Oliver James |           69 | 2010-08-09 | - DISCARD RECORD / DUPLICATE 
| 43470093 |            83386 | Oliver James |           87 | 2010-08-09 | - DISCARD RECORD / DUPLICATE 
| 43470093 |            83386 | Oliver James |          100 | 2010-08-09 | + KEEP
+----------+------------------+--------------+--------------+------------+



Table and Raw Data:

CREATE TABLE `log_embed_video` (
  `id` int(11) NOT NULL,
  `archive_asset_id` int(11) NOT NULL,
  `asset_title` varchar(255) NOT NULL,
    `view_percent` float NOT NULL,
  `created` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


INSERT INTO `log_embed_video` VALUES 
(43326898, 83386, 'Oliver James', 0, 1281327306),
(43326898, 83386, 'Oliver James', 13, 1281327327),
(43432090, 83386, 'Oliver James', 0, 1281371423),
(43432090, 83386, 'Oliver James', 17, 1281371445),
(43432092, 83386, 'Oliver James', 0, 1281371424),
(43432092, 83386, 'Oliver James', 17, 1281371446),
(43470093, 83386, 'Oliver James', 0, 1281380789),
(43470093, 83386, 'Oliver James', 17, 1281380810),
(43470093, 83386, 'Oliver James', 35, 1281380830),
(43470093, 83386, 'Oliver James', 52, 1281380850),
(43470093, 83386, 'Oliver James', 69, 1281380871),
(43470093, 83386, 'Oliver James', 87, 1281380891),
(43470093, 83386, 'Oliver James', 100, 1281380906);
6
  • I can't find a question in there - I really don't know what you are asking. Commented Sep 20, 2010 at 15:37
  • +1 for sample data and create table... however it is still not really clear (to me) what you want to achieve (try to explain 'desired results' also in words; just staring at the data didn't help me understand what you are trying to achieve) Commented Sep 20, 2010 at 15:44
  • How should the code handle crossing over to the next day (i.e.: 23:59 => 50 % / 00:01 => 55%)? Commented Sep 20, 2010 at 15:53
  • I am unsure to filter out the duplicate data. Referring to the conditioned data.. I have 4 rows that I want to keep, filtering out all the other rows and then be able to use an aggregate function and group by on the filtered data. Commented Sep 20, 2010 at 15:59
  • @ Wrikken - I need to filter out the duplicate rows... for example, there are 2 records with an id of 43326898. The row containing that id and the view_percent column with the max / highest value is the what I want to filter on Commented Sep 20, 2010 at 16:05

2 Answers 2

1

All max-percentage per-id rows for a unique id:

SELECT a.* 
FROM log_embed_video a 
LEFT JOIN log_embed_video b
ON b.id = a.id
AND b.view_percent > a.view_percent
WHERE b.id IS NULL
-- possibly limit on date for  more performance.

Performance wise this is better:

SELECT * FROM (
    SELECT id, archive_asset_id, asset_title, view_percent, created,
        @rn := IF(id != @old_id,1,@rn + 1) as rownumber,
        @old_id := id 
    FROM log_embed_video 
    JOIN (SELECT @rn:=0,@old_id:=0) void
    ORDER BY id, view_percent DESC
) a WHERE rownumber=1;
Sign up to request clarification or add additional context in comments.

8 Comments

+1 because it works :) I do wonder which solution will Scott find easier to understand...
Readability suffers if isn't familiar with the LEFT JOIN trick indeed. It also suffers a bit performance-wise. I might enter an even more unreadable answer that is a lot quicker...
There, now all readability is out the window in favor of performance :P
@ Wrikken - thank for submitting this query, but this does not yield the desired result. Both of the modified queries submitted by Unreason works for what I am trying to accomplish. Thank you!!!
@Scott: those were the generic 'I want the row with the MIN/MAX value of a set, with the related data from the same row'. Indeed not the full queries you'd want, but both result in the proper 'Conditioned Data' you mention and are handy in anyones (My)SQL's toolset, especially the performance (on normal InnoDB / MyISAM tables at least) of the second one. But no worries, Unreasons' answer indeed does work properly.
|
0

Check if this will make it more clear for you

SELECT archive_asset_id, AVG(actual_percent) 
FROM (SELECT id, archive_asset_id, asset_title, 
             MAX(view_percent) as actual_percent 
      FROM log_embed_video GROUP by id) T 
GROUP BY archive_asset_id;

It returns:

+------------------+---------------------+
| archive_asset_id | AVG(actual_percent) |
+------------------+---------------------+
|            83386 |               36.75 | 
+------------------+---------------------+

A few notes

  • this will not perform well on 100M records
  • also you might want to normalize your data to improve the performance (which it will do in this case; basically moving the actual final rows into their own table makes much more sense to me)
  • the expression COUNT(DISTINCT id * 1000000 + archive_asset_id) caught my eye as something bizarre; are you sure you don't mean simply COUNT(*) or COUNT(id)?

EDIT:

For the second one

SELECT archive_asset_id, actual_percent, count(*) 
FROM (SELECT id, archive_asset_id, asset_title,               
             MAX(view_percent) as actual_percent        
      FROM log_embed_video GROUP by id) T  
GROUP BY archive_asset_id, actual_percent;

+------------------+----------------+----------+
| archive_asset_id | actual_percent | count(*) |
+------------------+----------------+----------+
|            83386 |             13 |        1 | 
|            83386 |             17 |        2 | 
|            83386 |            100 |        1 | 
+------------------+----------------+----------+

4 Comments

COUNT(DISTINCT id * 1000000 + archive_asset_id) I assume is a kludge to get to COUNT(DISTINCT id,archive_asset_id)
We are using a Infobright Brighthouse database that is designed for querying large sets of data. Currently that database does not support COUNT(DISTINCT col1, col2) so that is their recommended work around
@ Unreason - both of these modified queries work quite well. When filtering the data down by date range and by a single asset id the results are returned in less than 200 ms with over 100+ million records in this table. Thank you for taking the time to answer my question!
@Scot; ah I see the logic for the DISTINCT.. well that's not necessary either now.. (P.S see faq re voting)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.