Speeding up a postgres query that filters on two columns and sorts on a function

Question

Update: I’ve found a solution for my case (my answer below) involving being judicious about when the “in” is used, but there may be more generally useful advice yet to be had.

I’m sure an answer to this question exists in plenty of places, but I’m having trouble finding it because my situation is slightly more complicated than what I’ve found discussion of in the Postgres documentation but much less complicated than any of the questions I’ve found on here that involve multiple tables or subqueries and are answered with an elaborate plan of attack. So I don’t mind being pointed to one of those existing answers that I’ve failed to find, so long as it actually helps in my situation.

Here’s an example of a query that’s causing me trouble:

SELECT trees.id FROM "trees" WHERE "trees"."trashed" = 'f' AND (trees.chapter_id IN (1,8,9,12,18,11,6,10,5,2,4,7,16,15,17,3,14,13)) ORDER BY LOWER(trees.shortcode);

This is generated by ActiveRecord in my Rails application and maybe I could rephrase the query to be more optimal somehow, but this result set (the IDs of all trees, in a textual order, filtered by "trashed" and belonging to a subset of "chapters") is something I currently need for a big paginated list of trees in the interface. (The subset of chapters is decided by the system of user permissions, so this query has to be invoked at least once when a user begins looking at the list.)

In my local version, there are about 67,000 trees in this table, and there will only ever be more in production.

Here’s the query plan given by EXPLAIN:

Sort  (cost=9406.85..9543.34 rows=54595 width=17)
  Sort Key: (lower((shortcode)::text))
  ->  Seq Scan on trees  (cost=0.00..3991.18 rows=54595 width=17)
        Filter: ((NOT trashed) AND (chapter_id = ANY ('{1,8,9,12,18,11,6,10,5,2,4,7,16,15,17,3,14,13}'::integer[])))

This becomes much faster if I remove the order by, obviously, but again, I need this list of IDs in a specific order to display even a page of this list. Locally, this query executes in about 2-3 seconds, which is way too long, and generally I’ve found that the database on heroku where the production version is takes similar or longer times as my local database does.

There are individual (btree) indices on trees.trashed, trees.chapter_id, and LOWER(trees.shortcode). I experimented with adding a multi-column index on trashed and chapter_id, but predictably, that didn’t help, because that’s not the slow part of this query. I don’t know enough about postgres or SQL to have an idea of where to go from here, which is why I’m asking for help. (I’d like to learn more, so any pointers to sections of the documentation that would give me a better sense of the kinds of things to investigate would be greatly appreciated as well.)

The list of chapters is never going to get much longer than this, so maybe it would be faster to filter on each individually? There are similar queries elsewhere in the application, so I would rather learn a general way to improve this kind of thing.

I may have forgotten to add some important information while writing this, so if there’s something that seems obviously wrong, please comment and I’ll try to clarify.

Update: Here’s the description of the trees table, as requested by a commenter.

                                     Table "public.trees"
      Column       |            Type             |                     Modifiers                      
-------------------+-----------------------------+----------------------------------------------------
 id                | integer                     | not null default nextval('trees_id_seq'::regclass)
 created_at        | timestamp without time zone | 
 updated_at        | timestamp without time zone | 
 shortcode         | character varying(255)      | 
 cross_id          | integer                     | 
 chapter_id        | integer                     | 
 name              | character varying(255)      | 
 classification    | character varying(255)      | 
 tag               | character varying(255)      | 
 alive             | boolean                     | not null default true
 latitude          | numeric(14,10)              | 
 longitude         | numeric(14,10)              | 
 city              | character varying(255)      | 
 county            | character varying(255)      | 
 state             | character varying(255)      | 
 comments          | text                        | 
 trashed           | boolean                     | not null default false
 created_by_id     | integer                     | 
 death_date        | date                        | 
 planted_as        | character varying(255)      | not null default 'seed'::character varying
 wild              | boolean                     | not null default false
 submitted_by_id   | integer                     | 
 owned_by_id       | integer                     | 
 steward_id        | integer                     | 
 planting_id       | integer                     | 
 planting_cross_id | integer                     | 
Indexes:
    "trees_pkey" PRIMARY KEY, btree (id)
    "index_trees_on_chapter_id" btree (chapter_id)
    "index_trees_on_created_by_id" btree (created_by_id)
    "index_trees_on_cross_id" btree (cross_id)
    "index_trees_on_trashed" btree (trashed)
    "trees_lower_classification_idx" btree (lower(classification::text))
    "trees_lower_name_idx" btree (lower(name::text))
    "trees_lower_shortcode_idx" btree (lower(shortcode::text))
    "trees_lower_tag_idx" btree (lower(tag::text))

My local trees table has 67406 rows, and there will be more in production.

Have you tried vacuuming and reindexing recently? Could be the planner is just way off on it's estimates. — Philip Hallstrom
– Philip Hallstrom, Commented Jan 28, 2014 at 21:29
If you want us to help optimize a query, you need to show us the table and index definitions, as well as row counts for each of the tables. Maybe your tables are defined poorly. Maybe the indexes aren't created correctly. Maybe you don't have an index on that column you thought you did. Without seeing the table and index definitions, we can't tell. We also need row counts because that can affect query optimization greatly. — Andy Lester
– Andy Lester, Commented Jan 28, 2014 at 21:31
@PhilipHallstrom: I just ran VACUUM ANALYZE locally and it didn’t change the explain or the actual query time. — zem
– zem, Commented Jan 28, 2014 at 21:34
@AndyLester: thanks, I’ve added the description of the table. — zem
– zem, Commented Jan 28, 2014 at 21:37

Denis de Bernardy · Accepted Answer · 2014-01-28 21:38:52Z

2

Based on your query plan, you're fetching 55k of 67k rows. No indexes are going to help you do that. The fastest plan will be to read the entire table, filter out the occasional unneeded row, and sort.

Naturally, the real question is whether you should be fetching that many rows to begin with, instead of paging them using limit ... offset. In the latter case, your indexes would become useful. The one on lower(shortcode) in particular, since it'll find matching rows very fast, and do so in the correct order.

edited Jan 28, 2014 at 21:38

answered Jan 28, 2014 at 21:33

Denis de Bernardy

79.2k14 gold badges138 silver badges158 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

zem Over a year ago

Thanks, Denis. So since (in this case, and this is true in many actual use cases) I’m fetching the majority of the rows in the table, I should just order the whole thing and then filter the results afterwards? That makes sense. I could do that in my Rails application, but maybe there’s a good way to do that in the SQL that I don’t know about.

Denis de Bernardy Over a year ago

@zem: Imagine you're a postman with a list of names, and that you're collecting mail from each one. You can either go through each street sequentially, checking mailboxes one by one until you find a matching name, or drive straight to the known address of each name on your list, going back and forth across town as you do. The first is the DB equivalent of a (filtered) seq scan; the second would correspond to an index scan. Depending on the number of names in your list, one or the other will be more efficient.

zem Over a year ago

The results are getting paginated in the interface, but because a user may not have visibility for an arbitrarily large proportion of the trees, it seems difficult to do this with order and limit before they’re filtered based on the chapter id and trashed properties. That’s what this current query exists for: to get all the right ids in the right order so that it can load more details about specific pages later.

zem Over a year ago

Sorry for the comment spam; I instinctively try to make new lines and that submits the comment. Anyway, I’ll see if I can just change the logic so it sorts the whole list and filters it in the application logic to generate the list I need. Thanks again. I’m holding off on accepting the answer because it hasn’t been very long, but if no one shows up with a miracle, I will accept this in a little while.

Denis de Bernardy Over a year ago

@zem: If you're cherry picking rows out of the entire set from within your application, you're doing something very, very wrong and inefficient. Revisit that design decision before it's too late.

|

Bruno · Accepted Answer · 2014-01-29 08:51:44Z

0

another thing that you try is to rewrite the in clause with a join, as its basically what is doing. See if this helps:

SELECT trees.id 
FROM "trees" 
JOIN ( 
    select unnest(array[1,8,9,12,18,11,6,10,5,2,4,7,16,15,17,3,14,13]) as chapter_id
) A using (chapter_id)
WHERE "trees"."trashed" = 'f' 
ORDER BY LOWER(trees.shortcode);

This query should now start using the index "index_trees_on_chapter_id", where as before it was doing a seqscan, but dependes on the size of table A (the chapter_id's). Be sure that the statistics are updated, and autovacuum is running (and tuned for your system). Also settings like seq_page_cost will alter the query planner in whether to use indexes or not.

answered Jan 29, 2014 at 8:51

Bruno

6414 silver badges6 bronze badges

1 Comment

zem Over a year ago

Neat idea. Unfortunately, it doesn’t help in this case — similarly complex plan from explain, and similar performance in the actual query. I’m kind of glad, because if this kind of transformation made a big difference, I would lose a lot of trust in postgres’s own optimization and feel compelled to try a lot of different trivial changes all the time. It seems that the real issue in my case is that it doesn’t have an efficient way to do that kind of IN without sequential scan, so my solution of minimizing the IN has made a big difference.

zem · Accepted Answer · 2014-01-29 21:46:15Z

Since much of the complication comes from the "in" here, and the most common cases are going to be "all chapters" (for administrators) or "one or two chapters" (for normal users, and which run much, much, much faster), I’ve decided to just optimize the "all chapters" case by leaving out the chapters clause when the application detects that that will be the case. This does not solve my problem in general, but it solves the issue in practice.

In general, I’ve decided to compare the list of included “parent IDs” in situations like this to all possible parent IDs, and to switch to an equivalent NOT IN if it’s more than half, or drop the clause entirely if the lists are the same. In all the practical cases I’ve tested, this drastically improves performance, and it will only be in very unusual cases that it will be anywhere near as slow as my initial strategy.

Collectives™ on Stack Overflow

Speeding up a postgres query that filters on two columns and sorts on a function

3 Answers 3

8 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related