I have a fairly large table (500K - 1M rows) in PostgreSQL 9.0 that contains generic "time slice" information, that is, it determines when a row in another table (a "feature") is valid. The definition looks like this (slightly simplified):
CREATE TABLE feature_timeslice
(
timeslice_id int NOT NULL,
feature_id int NOT NULL,
valid_time_begin timestamp NOT NULL,
valid_time_end timestamp,
sequence_number smallint,
-- Some other columns
CONSTRAINT pk_feature_timeslice PRIMARY KEY (timeslice_id)
-- Some other constraints
)
CREATE INDEX ix_feature_timeslice_feature_id
ON feature_timeslice USING btree (feature_id);
Many other tables for specific features are then joined to it on timeslice_id:
CREATE TABLE specific_feature_timeslice
(
timeslice_id int NOT NULL,
-- Other columns
CONSTRAINT pk_specific_feature_timeslice PRIMARY KEY (timeslice_id),
CONSTRAINT fk_specific_feature_timeslice_feature_timeslice FOREIGN KEY (timeslice_id) REFERENCES feature_timeslice (timeslice_id)
)
There may be multiple time slices with overlapping valid periods (begin/end time), but the one with the highest sequence_number takes priority (again, a slight simplification, but close enough). I'd like to efficiently find the currently valid row for each feature_id, so I have a view defined, like this:
CREATE VIEW feature_timeslice_id_now
AS
SELECT timeslice_id
FROM
(
SELECT timeslice_id, rank() OVER
(
PARTITION BY feature_id
ORDER BY sequence_number DESC, timeslice_id DESC
)
FROM feature_timeslice
WHERE (current_timestamp AT TIME ZONE 'UTC', '0'::interval) OVERLAPS (valid_time_begin, COALESCE(valid_time_end, 'infinity'::timestamp))
) subq
WHERE subq.rank = 1
It's typically queried like this:
SELECT *
FROM specific_feature_timeslice sf
JOIN feature_timeslice_id_now n USING (timeslice_id)
WHERE sf.name = 'SOMETHING'
This works, but it's still a bit too slow - takes 1-2 seconds, even though there may only be 1-5 rows returned, because the specific_feature_timeslice criteria generally narrows it down a lot. (More complex queries that join multiple feature views get very slow very quickly.) I can't figure out how to get PostgreSQL to do this more efficiently. The query plan looks like this:
Join Filter: ((r.timeslice_id)::integer = (subq.timeslice_id)::integer)
-> Subquery Scan on subq (cost=32034.36..37876.98 rows=835 width=4) (actual time=2086.125..5243.467 rows=250918 loops=1)
Filter: (subq.rank = 1)
-> WindowAgg (cost=32034.36..35790.33 rows=166932 width=10) (actual time=2086.110..4066.351 rows=250918 loops=1)
-> Sort (cost=32034.36..32451.69 rows=166932 width=10) (actual time=2086.065..2654.971 rows=250918 loops=1)
Sort Key: feature_timeslice.feature_id, feature_timeslice.sequence_number, feature_timeslice.timeslice_id
Sort Method: quicksort Memory: 13898kB
-> Seq Scan on feature_timeslice (cost=0.00..17553.93 rows=166932 width=10) (actual time=287.270..1225.595 rows=250918 loops=1)
Filter: overlaps(timezone('UTC'::text, now()), (timezone('UTC'::text, now()) + '00:00:00'::interval), (valid_time_begin)::timestamp without time zone, COALESCE((valid_time_end)::timestamp without time zone, 'infinity'::timestamp without time zone))
-> Materialize (cost=0.00..1093.85 rows=2 width=139) (actual time=0.002..0.007 rows=2 loops=250918)
-> Seq Scan on specific_feature_timeslice sf (cost=0.00..1093.84 rows=2 width=139) (actual time=1.958..7.674 rows=2 loops=1)
Filter: ((name)::text = 'SOMETHING'::text)
Total runtime: 10319.875 ms
In reality, I'd like to do this query for any given time, not just the current time. I have a function defined for that, which takes the time as an argument, but querying for "now" is the most common scenario, so even if I could only speed that up it would be a great improvement.
== Edit ==
OK, I've tried normalising the table as suggested by both answers - that is, I moved valid_time_begin and valid_time_end into a separate table, time_period. I also replaced the window function with WHERE NOT EXISTS ([better candidate time slice]). In the process I also upgraded to PostgreSQL 9.1. With all that some queries are now twice as fast. The query plan looks the same as in wildplasser's answer. This is good, but not quite as good as I'd hoped - it still takes over a second to select from a single feature table.
Ideally, I'd like to take advantage of the selectivity of the feature WHERE condition, as Erwin Brandstetter says. If I hand-craft a query to do this the time I get is 15-30 ms. Now that's more like it! The hand-crafted query looks something like this:
WITH filtered_feature AS
(
SELECT *
FROM specific_feature_timeslice sf
JOIN feature_timeslice ft USING (timeslice_id)
WHERE sf.name = 'SOMETHING'
)
SELECT *
FROM filtered_feature ff
JOIN
(
SELECT timeslice_id
FROM filtered_feature candidate
JOIN time_period candidate_time ON candidate.valid_time_period_id = candidate_time.id
WHERE ('2011-09-26', '0'::interval) OVERLAPS (candidate_time.valid_time_begin, COALESCE(candidate_time.valid_time_end, 'infinity'::timestamp))
AND NOT EXISTS
(
SELECT *
FROM filtered_feature better
JOIN time_period better_time ON better.valid_time_period_id = better_time.id
WHERE ('2011-09-26', '0'::interval) OVERLAPS (better_time.valid_time_begin, COALESCE(better_time.valid_time_end, 'infinity'::timestamp))
AND better.feature_id = candidate.feature_id AND better.timeslice_id != candidate.timeslice_id
AND better.sequence_number > candidate.sequence_number
)
) AS ft ON ff.timeslice_id = ft.timeslice_id
Unfortunately, this is way too big and complex to use in normal queries, which might join many other tables. I need some way to encapsulate this logic in a function (for arbitrary time) or at least a view (for current time), but I cannot figure out how to do this while still getting the query planner to filter on the specific feature first. If only I could pass a rowset into a function - but as far as I know PostgreSQL doesn't allow this. Any ideas?
== Conclusion ==
I ended up using PostgreSQL inheritance to solve this (see my answer), but I would not have come up with this idea if it wasn't for Erwin Brandstetter answer, so the bounty goes to him. wildplasser's answer was also very helpful, because it allowed me to eliminate the unnecessary window function, which sped it up further. Many thanks to both of you!
feature_idmight speed up the sorting for the rank() evaluation