Query Optimization

Question

That's my current query, it works but it is slow:

    SELECT row, MIN(flg) ||' to ' ||Max (flg) as xyz , avg(amt_won), count(*)
FROM(
SELECT (ROW_NUMBER() OVER (ORDER BY flg))*100/
(SELECT  count(*)+100 as temprow FROM temporary_six_max) as row, flg, amt_won
FROM temporary_six_max 
    JOIN (
    SELECT id_player AS pid,  avg(flg_vpip::int)  AS flg 
    FROM temporary_six_max
    GROUP BY id_player 
    ) AS auxtable
    ON pid = id_player
) as auxtable2
group by 1
order by 1;

I am grouping in fixed (or almost fixed) count 100 ranges that are ordered by avg(flg_vpip) grouped by id_player.

Here I've pasted the results in case it may help to understand: https://spreadsheets0.google.com/ccc?key=tFVsxkWVn4fMWYBxxGYokwQ&authkey=CNDvuOcG&authkey=CNDvuOcG#gid=0

I wonder if there is a better function to use than ROW_NUMBER() in this case and I feel like I am doing too many subselects but I don't know how to optimize it.

I'll appreciate very much any help.

If something is not clear just let me know.

Thank you.

EDIT:

The reason I created auxtable 2, is because when I use (ROW_NUMBER() OVER (ORDER BY flg), and use other agregate commands such as avg(amt_won) and count(*), which are essential, I get an error saying that flg should be in the aggregate function, but I can't order by an aggregate function of flg.

Please post the output of EXPLAIN ANALYZE as well. And please explain what you are actually trying to achieve with the nested selects (I mean auxtable4 not the derived table auxtable2 Currently I don't understand the goal of that — user330315
– user330315, Commented Jan 7, 2011 at 17:55
I posted the EXPLAIN ANALYZE in the gdocs on the link above. The answer for the creation of auxtable4 and auxtable 2 are on the edited post. Thanks. — joaoavf
– joaoavf, Commented Jan 7, 2011 at 18:10
max(row_number()) is essentially count(*) over (...). But I still don't understand the calculations you do to that (dividing and the +100). But I do understand the intention now. — user330315
– user330315, Commented Jan 7, 2011 at 18:55
I multiply by 100 and then divide by the number of rows because after this I can make groups that are almost equally populated(in this case 100 groups). Count (*) works better, thank you very much : ) The + 100 is just a trick to group correctly. — joaoavf
– joaoavf, Commented Jan 7, 2011 at 19:08
I created a post ( stackoverflow.com/questions/4453321/… ), which basically asked how to create equally populated groups, but I don't think it is a very popular way, so I designed my own method, which is not fast. That is what the calculations do. — joaoavf
– joaoavf, Commented Jan 7, 2011 at 19:16

araqnid · Accepted Answer · 2011-01-07 21:32:45Z

I generated some data to test with like this:

create table temporary_six_max as
select id_player, flg_vpip,
       random()*100 * (case flg_vpip when 0 then 1 else -1 end) as amt_won
from (select (random()*1000)::int as id_player, random()::int as flg_vpip
      from generate_series(1,1000000)) source;
create index on temporary_six_max(id_player);

Your query runs successfully against that, but doesn't quite generate the same plan, I get a nested loop in the lower arm rather than a merge and a seq scan in the init-plan-- you haven't turned off enable_seqscan I hope?

A solution just using a single scan of the table:

select row, min(flg) || ' to ' || max(flg) as xyz, avg(amt_won), count(*)
from (select flg, amt_won, ntile(100) over(order by flg) as row
      from (select id_player as pid, amt_won,
                   avg(flg_vpip::int) over (partition by id_player) as flg
            from temporary_six_max
           ) player_stats
     ) chunks
group by 1
order by 1

The bad news is that this actually performs worse on my machine, especially if I bump work_mem up enough to avoid the first disk sort (making player_stats, sorting by flg). Although increasing work_mem did halve the query time, so I guess that is at least a start?

Having said that, my queries are running for about 5 seconds to process 10E6 input rows in temporary_six_max, which is an order of magnitude faster than you posted. Does your table fit into your buffer cache? If not, a single-scan solution may be much better for you. (Which version of Postgresql are you using? "explain (analyze on, buffers on) select..." will show you buffer hit/miss rates in 9.0, or just look at your "shared_buffers" setting and compare with the table size)

Collectives™ on Stack Overflow

Query Optimization

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related