275

I need to take the first N rows for each group, ordered by custom column.

Given the following table:

db=# SELECT * FROM xxx;
 id | section_id | name
----+------------+------
  1 |          1 | A
  2 |          1 | B
  3 |          1 | C
  4 |          1 | D
  5 |          2 | E
  6 |          2 | F
  7 |          3 | G
  8 |          2 | H
(8 rows)

I need the first 2 rows (ordered by name) for each section_id, i.e. a result similar to:

 id | section_id | name
----+------------+------
  1 |          1 | A
  2 |          1 | B
  5 |          2 | E
  6 |          2 | F
  7 |          3 | G
(5 rows)

I am using PostgreSQL 8.3.5.

6 Answers 6

401

New solution (PostgreSQL 8.4)

SELECT
  * 
FROM (
  SELECT
    ROW_NUMBER() OVER (PARTITION BY section_id ORDER BY name) AS r,
    t.*
  FROM
    xxx t) x
WHERE
  x.r <= 2;
Sign up to request clarification or add additional context in comments.

4 Comments

This works with PostgreSQL 8.4 too (window functions start with 8.4).
Awesome! It works flawlessly. I am curious though, is there a way to do this with group by?
For those who works with like millions rows and seeks for really performant way to do this - poshest's answer is the way to go. Just dont forget to spice ti up with proper indexing.
This works in mySQL 8.0.24, fast condensation of 6M rows of associated email addresses to five in each category (no sorting or criteria, just needed five email addresses. A small number of categories (companies) had several thousand addresses.)
99

Since v9.3 you can do a lateral join

select distinct t_outer.section_id, t_top.id, t_top.name from t t_outer
join lateral (
    select * from t t_inner
    where t_inner.section_id = t_outer.section_id
    order by t_inner.name
    limit 2
) t_top on true
order by t_outer.section_id;

It might be faster but, of course, you should test performance specifically on your data and use case.

9 Comments

Very cryptic solution IMO, specially with those names, but a good one.
This solution with LATERAL JOIN might be significantly faster than above one with windowed function (in some cases) if you have index by t_inner.name column
The query is easier to understand if it does not contain the self-join. In that case distinct is not needed. An example is shown in the link poshest posted.
Dude, this is mindlowing. 120ms instead of 9sec yielded with "ROW_NUMBER" solution. Thank you!
Running a variant of this lateral join on a table with 2 million rows put my DB's CPU at 100% for 12 hours. YMMV.
|
22

A lateral join is the way to go, but you should do a nested query first to improve performance on large tables.

SELECT t_limited.*
FROM (
        SELECT DISTINCT section_id
        FROM t
    ) t_groups
    JOIN LATERAL (
        SELECT *
        FROM t t_all
        WHERE t_all.section_id = t_groups.section_id
        ORDER BY t_all.name
        LIMIT 2
    ) t_limited ON true

Without the nested select distinct, the join lateral runs for every line in the table, even though the section_id is often duplicated. With the nested select distinct, the join lateral runs once and only once for each distinct section_id.

2 Comments

I tried this out, and it is much faster than other solutions. I'm discarding my previous solution based on Window Functions in favour to this one
Note that Postgres is very bad at optimizing SELECT DISTINCT, that will usually result in a full table scan. So only do that if the table is small, else try to get the list of section_ids in another way.
12

Here's another solution (PostgreSQL <= 8.3).

SELECT
  *
FROM
  xxx a
WHERE (
  SELECT
    COUNT(*)
  FROM
    xxx
  WHERE
    section_id = a.section_id
  AND
    name <= a.name
) <= 2

Comments

2
SELECT  x.*
FROM    (
        SELECT  section_id,
                COALESCE
                (
                (
                SELECT  xi
                FROM    xxx xi
                WHERE   xi.section_id = xo.section_id
                ORDER BY
                        name, id
                OFFSET 1 LIMIT 1
                ),
                (
                SELECT  xi
                FROM    xxx xi
                WHERE   xi.section_id = xo.section_id
                ORDER BY 
                        name DESC, id DESC
                LIMIT 1
                )
                ) AS mlast
        FROM    (
                SELECT  DISTINCT section_id
                FROM    xxx
                ) xo
        ) xoo
JOIN    xxx x
ON      x.section_id = xoo.section_id
        AND (x.name, x.id) <= ((mlast).name, (mlast).id)

4 Comments

The query is very close to the one I need, except that it is not showing sections with less than 2 rows, i.e. the row with ID=7 isn't returned. Otherwise I like your approach.
Thank you, I just came to the same solution with COALESCE, but you were faster. :-)
Actually the last JOIN sub-clause could be simplified to: ... AND x.id <= (mlast).id as the ID have already been chosen according to the name field, no?
@Kouber: in your example the name's and id's are sorted in same order, so you won't see it. Make the names in reverse order and you will see that these queries yield different results.
2
        -- ranking without WINDOW functions
-- EXPLAIN ANALYZE
WITH rnk AS (
        SELECT x1.id
        , COUNT(x2.id) AS rnk
        FROM xxx x1
        LEFT JOIN xxx x2 ON x1.section_id = x2.section_id AND x2.name <= x1.name
        GROUP BY x1.id
        )
SELECT this.*
FROM xxx this
JOIN rnk ON rnk.id = this.id
WHERE rnk.rnk <=2
ORDER BY this.section_id, rnk.rnk
        ;

        -- The same without using a CTE
-- EXPLAIN ANALYZE
SELECT this.*
FROM xxx this
JOIN ( SELECT x1.id
        , COUNT(x2.id) AS rnk
        FROM xxx x1
        LEFT JOIN xxx x2 ON x1.section_id = x2.section_id AND x2.name <= x1.name
        GROUP BY x1.id
        ) rnk
ON rnk.id = this.id
WHERE rnk.rnk <=2
ORDER BY this.section_id, rnk.rnk
        ;

6 Comments

CTEs and Window functions were introduced with the same version, so I don't see the benefit of the first solution.
The post is three years old. Besides, there may still be implementations that lack them (nudge nudge say no more). It could also be considered an exercise in old-fashoned querybuilding. (though CTEs are not very old-fashoned)
The post is tagged "postgresql" and the PostgreSQL version that introduced CTEs also introduced windowing functions. Hence my comment (I did see it's that old - and PG 8.3 did have neither)
The post mentions 8.3.5, and I believe they were introduced in 8.4. Besides: it is also good to know about alternative scenarios, IMHO.
That's exactly what I mean: 8.3 neither had CTEs nor windowing functions. So the first solution won't work on 8.3
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.