0

I'm implementing a view to store leaderboard data of the top 10 users that is computed using an expensive COUNT(*). I'm planning on the view to look something like this:

id SERIAL PRIMARY KEY
user_id TEXT
type TEXT
rank INTEGER
count INTEGER

-- adding an index to user_id
-- adding a two-column unique index to user_id and type

I'm having trouble with seeing how this view should be created to properly account for the rank and type. Essentially, I have a big table (~30 million rows) like this:

+----+---------+---------+----------------------------+
| id | user_id |  type   |         created_at         |
+----+---------+---------+----------------------------+
|  1 |       1 | Diamond | 2021-05-11 17:35:18.399517 |
|  2 |       1 | Diamond | 2021-05-12 17:35:17.399517 |
|  3 |       1 | Diamond | 2021-05-12 17:35:18.399517 |
|  4 |       2 | Diamond | 2021-05-13 17:35:18.399517 |
|  5 |       1 | Clay    | 2021-05-14 17:35:18.399517 |
|  6 |       1 | Clay    | 2021-05-15 17:35:18.399517 |
+----+---------+---------+----------------------------+

With the table above, I'm trying to achieve something like this:

+----+---------+---------+------+-------+
| id | user_id |  type   | rank | count |
+----+---------+---------+------+-------+
|  1 |       1 | Diamond |    1 |     3 |
|  2 |       2 | Diamond |    2 |     1 |
|  3 |       1 | Clay    |    1 |     2 |
|  4 |       1 | Weekly  |    1 |     5 | -- 3 diamonds + 2 clay obtained between Mon-Sun
|  5 |       2 | Weekly  |    2 |     1 |
+----+---------+---------+------+-------+

By Weekly I am counting the time from the last Sunday to the upcoming Sunday.

Is this doable using only SQL, or is some kind of script needed? If doable, how would this be done? It's worth mentioning that there are thousands of different types, so not having to manually specify type would be preferred.

If there's anything unclear, please let me know and I'll do my best to clarify. Thanks!

5
  • What's the difference between id and user_id? Commented Dec 22, 2021 at 12:29
  • id is essentially just the row number/id that can be used as the primary key. Commented Dec 22, 2021 at 12:30
  • The first three rows of the result can be produced with a GROUP BY user_id, type combined with a RANK() function. Now, to get the last two rows, you'll need to clarify when does the week start and end. Commented Dec 22, 2021 at 12:34
  • The week is essentially from a Sunday (i.e. 19th Dec) to Sunday (i.e. 26th Dec). Commented Dec 22, 2021 at 12:35
  • @TheImpaler How exactly does the RANK() function work in this query? Commented Dec 22, 2021 at 13:17

2 Answers 2

1

The "weekly" rows are produced in a different way compared to the "user" rows (I called them two different "categories"). To get the result you want you can combine two queries using UNION ALL.

For example:

select 'u' as category, user_id, type,
  rank() over(partition by type order by count(*) desc) as rk,
  count(*) as cnt
from scores
group by user_id, type
union all
select 'w', user_id, 'Weekly',
  rank() over(order by count(*) desc),
  count(*) as cnt
from scores
group by user_id
order by category, type desc, rk

Result:

 category  user_id  type     rk  cnt 
 --------- -------- -------- --- --- 
 u         1        Diamond  1   3   
 u         2        Diamond  2   1   
 u         1        Clay     1   2   
 w         1        Weekly   1   5   
 w         2        Weekly   2   1   

See running example at DB Fiddle.

Note: For the sake of simplicity I left the filtering by timestamp out of the query. If you really needed to include only the rows of the last 7 days (or other period of time), it would be a matter of adding a WHERE clause in both subqueries.

Sign up to request clarification or add additional context in comments.

Comments

1

I think this is what you were talking about, right?

WITH scores_plus_weekly AS ((
        SELECT id, user_id, 'Weekly' AS type, created_at
        FROM scores
        WHERE created_at BETWEEN '2021-05-10' AND '2021-05-17'
    )
    UNION (
        SELECT * FROM scores
    ))
SELECT
    row_number() OVER (ORDER BY CASE "type" WHEN 'Diamond' THEN 0 WHEN 'Clay' THEN 1 ELSE 2 END, count(*) DESC) as "id",
    user_id,
    "type",
    row_number() OVER (PARTITION BY count(*) DESC) as "rank",
    count(*)
FROM scores_plus_weekly
GROUP BY user_id, "type"
ORDER BY "id";

I'm sure this is not the only way, but I thought the result wasn't too complex. This query first combines the original database with all scores from this week. For the sake of consistency I picked a date range that matches your entire example set. It then groups by user_id and type to get the counts for each combination. The row_numbers will give you the overall rank and the rank per type. A big part of this query consists of sorting by type, so if you're joining another table that contains the order or priority of the types, the CASE can probably be simplified.

Then, lastly, this entire query can be caught in a view using the CREATE VIEW score_ranks AS , followed by your query.

7 Comments

Thanks for the answer and the detailed explanation! Question: is it possible to not have to specify the diamond/clay? There are quite a lot of types, so manually adding them would be hard.
I simplified the example a bit, but the best approach would be to have another table that holds an integer value like order, that ranks your labels in the correct order. E.g. (label, order) VALUES ('Diamond', 1), ('Gold', 2), ... ('Clay', 6), etc. You can then join that table by label name and take its order field as input for sorting.
I see. TheImpaler said something about using the RANK() function for the ranking, any ideas on how that could be combined with this solution?
I'm not entirely sure what he had in mind, but it's another function that can be used on a window (the OVER (...) part). It counts differently; it starts at 1 and sticks to that number and only increments for each new partition. You can get similar results as with row_number, or it could be used as 'preparation', useful input for a followup query.
It counts the rows and numbers them. The window after it tells when it should restart numbering (the partition part) and in which order it should increment (first by type, then by count in descending order). I removed the PARTITION BY TRUE part as it's redundant when it contains an ORDER BY clause. Partitioning by TRUE is a simple way of saying there's only one partition (e.g. don't reset the counter of row_number).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.