Postgres count(*) optimization idea

Question

I'm currently working on a project involving keeping track of users and their actions with my database (PostgreSQL as the RDMS), and I have run into an issue when trying to perform COUNT(*) on occurrences of each user. What I want is to be able to, efficiently, count the number of times each user appears from every record, and also be able to achieve looking at counts on a particular date range.

So, the problem is how do we achieve counting the total number of times a user appears from the tables contents, and how do we count the total number on a date range.

What I've tried

As you might know, Postgres doesn't support COUNT(*) very well using indices, so we have to consider other ways to reduce the # of records it looks at in order to speed up the query. So my first approach is to create a table to keep track of the number of times a user has a log message associated with them, and on what day (similar to the idea behind a materialized view, but I dont want continually refresh the materialized view with my count query). Here is what I've come up with:

CREATE TABLE users_counts(user varchar(65536), counter int default 0, day date);

CREATE RULE inc_user_date_count 
AS ON INSERT TO main_table 
DO ALSO UPDATE users_counts SET counter = counter + 1 
WHERE user = NEW.user AND day = DATE(NEW.date_);

What this does is every time a new record is inserted into my 'main_table', we update the current users_counts table to increment the records whose date is equal to the new records date, and the user names are the same.

NOTE: the date_ column in 'main_table' is a timestamp so I must cast the new records date_ to be a DATE type.

The problem is, what if the user column value doesn't already exist in my new table 'users_count' for the current day, then nothing is updated.

Here is my question:

How do I write the rule such that we check if a user exists for the current day, if so increment that counter, otherwise insert new row with user, day, and counter of 1;

I also would like to know if my approach makes sense to do, or is there any ideas I am missing that I just haven't thought about. As my database grows, it is increasingly inefficient to perform counting, so I want to avoid any performance bottlenecks.

EDIT 1: I was able to actually figure this out by creating a separate RULE but I'm not sure if this is correct:

CREATE RULE test_insert AS ON INSERT TO main_table 
DO ALSO INSERT INTO users_counts(user, counter, day) 
SELECT NEW.user, 1, DATE(NEW.date) 
WHERE NOT EXISTS (SELECT user FROM users.log_messages WHERE user = NEW.user_);

Basically, an insert happens if the user doesn't already exist in my CACHED table called user_counts, and the first rule above updates the count.

What I'm unsure of is how do I know when which rule is called first, the update rule or insert.. And there must be a better way, how do I combine the two rules? Can this be done with a function?

e4c5 · Accepted Answer · 2015-10-13 07:37:57Z

1

It is true that postgresql is notoriously slow when it comes to count(*) queries. However if you do have a where clause that limits the number of entries the query will be much faster. If you are using postgresql 9.2 or newer this query will be just as fast as it's in mysql because of index only scans which was added in 9.2 but it's best to explain analyze your query to make sure.

Does my solution make sense?

Very much so provided that your explain analyze show that index only scans are not being used. Trigger based solutions like the one that you have adapted find wide usage. But as you have realized the problem with the initial state arises (whether to do an update or an insert).

which rule is called first

Multiple rules on the same table and same event type are applied in alphabetical name order.

from http://www.postgresql.org/docs/9.1/static/sql-createrule.html the same applies for triggers. If you want a particular rule to be executed first change it's name so that it comes up higher in the alphabetical order.

how do I combine the two rules?

One solution is to modify your rule to perform an upsert (Look right at the bottom of that page for a sample upsert ). The other is to populate the counter table with initial values. The trick is to create the trigger at the same time to avoid errors. This blog post explains it really well.

While the initial setup will be slow each individual insert will probably be faster. The two opposing factors being the slowness of a WHERE NOT EXISTS query vs the overhead of catching an exception.

Tip: A block containing an EXCEPTION clause is significantly more expensive to enter and exit than a block without one. Therefore, don't use EXCEPTION without need.

Source the postgresql documentation page linked above.

answered Oct 13, 2015 at 7:37

e4c5

53.9k11 gold badges110 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

rosenthal Over a year ago

Very nice answer, I appreciate it. I did see that with adding proper indices, the COUNT(*) is faster with an index only scan, the problem is one of my use cases is looking at a date range and returning the total # of users from that. Well, suppose we are looking at a month where there was 1 million records, the count operation is still slow when the only predicates I need are on date, and the indices help but aren't fast enough.

rosenthal Over a year ago

And although very 'hacky' I guess I can either choose to purposefully alphabetize the rules such that we try the update first, then the insert since inserting first followed by update would lead to a double increment. But I do like the 'upsert' function you showed better :)

e4c5 Over a year ago

Glad to have contributed. Guess under the circumstances you have no choice but to use the trigger/rule based approach then. You could perhaps create your rules like increment_counter and initialize_counter and still keep the rule names meaningful.

rosenthal Over a year ago

One thing that you also got me to notice is that once more filtering is applied, my queries for count(*) are much quicker than just filtering on date alone. This is due to the fact that my database consists of logs. Does it make sense to use my 'cached' table for queries in which don't have filtering other than date, and to use my normal queries when enough filtering is applied? This is something easily determinable within my project, so it is something I can do.

e4c5 Over a year ago

If you mean to just count the total number of rows without any filtering and a result that is a very good estimate is ok, you can do: select reltuples from pg_class where relname = 'your_table_name' (note doesn't work when partitions are in use and doesn't work on views)

Collectives™ on Stack Overflow

Postgres count(*) optimization idea

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related