Update: I’ve found a solution for my case (my answer below) involving being judicious about when the “in” is used, but there may be more generally useful advice yet to be had.
I’m sure an answer to this question exists in plenty of places, but I’m having trouble finding it because my situation is slightly more complicated than what I’ve found discussion of in the Postgres documentation but much less complicated than any of the questions I’ve found on here that involve multiple tables or subqueries and are answered with an elaborate plan of attack. So I don’t mind being pointed to one of those existing answers that I’ve failed to find, so long as it actually helps in my situation.
Here’s an example of a query that’s causing me trouble:
SELECT trees.id FROM "trees" WHERE "trees"."trashed" = 'f' AND (trees.chapter_id IN (1,8,9,12,18,11,6,10,5,2,4,7,16,15,17,3,14,13)) ORDER BY LOWER(trees.shortcode);
This is generated by ActiveRecord in my Rails application and maybe I could rephrase the query to be more optimal somehow, but this result set (the IDs of all trees, in a textual order, filtered by "trashed" and belonging to a subset of "chapters") is something I currently need for a big paginated list of trees in the interface. (The subset of chapters is decided by the system of user permissions, so this query has to be invoked at least once when a user begins looking at the list.)
In my local version, there are about 67,000 trees in this table, and there will only ever be more in production.
Here’s the query plan given by EXPLAIN:
Sort (cost=9406.85..9543.34 rows=54595 width=17)
Sort Key: (lower((shortcode)::text))
-> Seq Scan on trees (cost=0.00..3991.18 rows=54595 width=17)
Filter: ((NOT trashed) AND (chapter_id = ANY ('{1,8,9,12,18,11,6,10,5,2,4,7,16,15,17,3,14,13}'::integer[])))
This becomes much faster if I remove the order by, obviously, but again, I need this list of IDs in a specific order to display even a page of this list. Locally, this query executes in about 2-3 seconds, which is way too long, and generally I’ve found that the database on heroku where the production version is takes similar or longer times as my local database does.
There are individual (btree) indices on trees.trashed, trees.chapter_id, and LOWER(trees.shortcode). I experimented with adding a multi-column index on trashed and chapter_id, but predictably, that didn’t help, because that’s not the slow part of this query. I don’t know enough about postgres or SQL to have an idea of where to go from here, which is why I’m asking for help. (I’d like to learn more, so any pointers to sections of the documentation that would give me a better sense of the kinds of things to investigate would be greatly appreciated as well.)
The list of chapters is never going to get much longer than this, so maybe it would be faster to filter on each individually? There are similar queries elsewhere in the application, so I would rather learn a general way to improve this kind of thing.
I may have forgotten to add some important information while writing this, so if there’s something that seems obviously wrong, please comment and I’ll try to clarify.
Update: Here’s the description of the trees table, as requested by a commenter.
Table "public.trees"
Column | Type | Modifiers
-------------------+-----------------------------+----------------------------------------------------
id | integer | not null default nextval('trees_id_seq'::regclass)
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
shortcode | character varying(255) |
cross_id | integer |
chapter_id | integer |
name | character varying(255) |
classification | character varying(255) |
tag | character varying(255) |
alive | boolean | not null default true
latitude | numeric(14,10) |
longitude | numeric(14,10) |
city | character varying(255) |
county | character varying(255) |
state | character varying(255) |
comments | text |
trashed | boolean | not null default false
created_by_id | integer |
death_date | date |
planted_as | character varying(255) | not null default 'seed'::character varying
wild | boolean | not null default false
submitted_by_id | integer |
owned_by_id | integer |
steward_id | integer |
planting_id | integer |
planting_cross_id | integer |
Indexes:
"trees_pkey" PRIMARY KEY, btree (id)
"index_trees_on_chapter_id" btree (chapter_id)
"index_trees_on_created_by_id" btree (created_by_id)
"index_trees_on_cross_id" btree (cross_id)
"index_trees_on_trashed" btree (trashed)
"trees_lower_classification_idx" btree (lower(classification::text))
"trees_lower_name_idx" btree (lower(name::text))
"trees_lower_shortcode_idx" btree (lower(shortcode::text))
"trees_lower_tag_idx" btree (lower(tag::text))
My local trees table has 67406 rows, and there will be more in production.