1

I have some big sql query with many calculating columns in SELECT block. Also, there is ordering by one of those calculating columns and limit for only 100 rows. But postgres calculates all columns for every row, not for only 100.

Let me explain on example.

Let's create some test table:

CREATE TABLE test_main(col1 INTEGER);

And fill it with some random data:

DO
$do$
BEGIN
  FOR r IN 1..100000 LOOP
    INSERT INTO test_main(col1) VALUES (trunc(random()*1000));
  END LOOP;
END
$do$;

Then create some additional tables:

CREATE TABLE test_main_agg1(
  col1 INTEGER,
  val INTEGER
);
CREATE TABLE test_main_agg2(
  col1 INTEGER,
  val INTEGER
);

And fill it too:

DO
$do$
DECLARE
 r test_main%rowtype;
BEGIN
  FOR r IN SELECT * FROM test_main LOOP
    FOR i IN 1..5 LOOP
      INSERT INTO test_main_agg1(col1, val) VALUES (r.col1, trunc(random()*1000));
      INSERT INTO test_main_agg2(col1, val) VALUES (r.col1, trunc(random()*1000));
    END LOOP;
  END LOOP;
END
$do$;

And, of course, create some indexes:

CREATE INDEX test_main_indx ON test_main(col1);
CREATE INDEX test_main_agg1_val_indx ON test_main_agg1(col1,val);
CREATE INDEX test_main_agg2_val_indx ON test_main_agg2(col1,val);

Now, if we execute this query:

SELECT col1,
       (SELECT MAX(val) FROM test_main_agg1 g WHERE g.col1=m.col1) max_val1,
       (SELECT MAX(val) FROM test_main_agg2 g WHERE g.col1=m.col1) max_val2
  FROM test_main m
 LIMIT 100;

It will be very fast because of indexes. If we add ORDER BY col1 it is still going to be fast. But if we will use ORDER BY max_val1, then it will take about 2 seconds. If we run EXPLAIN ANALYZE on query with `ORDER BY max_val1, we will see this rows:

SubPlan 4
 -> Result (cost=4.06..4.07 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=100000)
  InitPlan 3 (returns $3)
   -> Limit (cost=0.42..4.06 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=100000)
    -> Index Only Scan Backward using test_main_agg2_val_indx on test_main_agg2 g_1 (cost=0.42..1818.25 rows=500 width=4) (actual time=0.010..0.010 rows=1 loops=100000)
     Index Cond: ((col1 = m.col1) AND (val IS NOT NULL))
     Heap Fetches: 100000

It means, that postgres calculate max_val2 for 100000 rows, but not for only 100 rows. I inderstand why postgres needs to calculate max_val1, but not max_val2.

Maybe there is some hints or something like this to tell postgres calculate columns after it execute ordering and limit?

2
  • 1
    Unrelated, but: your DO blocks can be replaced with simple INSERT statements: e.g.: insert into test_main(col1) select trunc(random()*1000) from generate_series(1,100000); Commented Apr 5, 2018 at 12:25
  • Thank you, i didn't know about generate_series function. Will remember it:) Commented Apr 5, 2018 at 12:38

1 Answer 1

1

LIMIT limits the output of the overall query, not of sub-queries inside the main query. If you only want the max of 100 rows, you need to first select them, then apply the max() on that subset:

SELECT col1,
       (SELECT MAX(val) FROM test_main_agg1 g WHERE g.col1=m.col1) max_val1,
       (SELECT MAX(val) FROM test_main_agg2 g WHERE g.col1=m.col1) max_val2
FROM (
  select val, col1
  from test_main
  LIMIT 100
) m;

Note that limit without an ORDER BY does not really make sense. Rows in a relational database have no order. So there is no such thing as "the first 100 rows" in a table unless you specify a sort order.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, you are right. I forget about possibility to using ORDER BY and LIMIT in subquery. Then i can rewrite query from my example to this one: SELECT t.*, (SELECT ...) max_val2 FROM (SELECT col1, (SELECT ...) max_val1 FROM test_main ORDER BY max_val1 LIMIT 100) t;

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.