I have a PostgreSQL table with ~millions of rows (timestamp, event, distinct_id, properties) where I want to
- filter based on
event(low cardinality) andtimestamp, - and group by
date_trunc('week', timestamp)anddistinct_id(high cardinality)
I have already squeezed the performance of COUNT(*) ... GROUP BY week queries, specifically these ones:
SELECT
date_trunc('week', timestamp) AS "timestamp"
, count(*) AS "count"
FROM telemetry_events
WHERE (
(event = 'view load' OR event = 'view:loaded')
AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY
date_trunc('week', timestamp)
ORDER BY
date_trunc('week', timestamp) ASC
;
by creating two indexes:
CREATE INDEX idx_timestamp ON public.telemetry_events USING btree ("timestamp")
CREATE INDEX telemetry_events_event_timestamp_idx ON public.telemetry_events USING btree (event, "timestamp")
The current execution time is under 7 seconds, which I find excellent (at least for now). This is the EXPLAIN ANALYZE:
GroupAggregate (cost=83849.24..87478.16 rows=181418 width=16) (actual time=6971.824..6989.712 rows=21 loops=1)
Group Key: (date_trunc('week'::text, "timestamp"))
-> Sort (cost=83849.24..84302.97 rows=181493 width=8) (actual time=6971.735..6978.683 rows=116425 loops=1)
Sort Key: (date_trunc('week'::text, "timestamp"))
Sort Method: quicksort Memory: 8189kB
-> Index Only Scan using telemetry_events_event_timestamp_idx on telemetry_events (cost=0.43..67996.23 rows=181493 width=8) (actual time=6507.580..6906.571 rows=116425 loops=1)
Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
Rows Removed by Filter: 639794
Heap Fetches: 0
Planning Time: 1.873 ms
Execution Time: 6989.778 ms
However, if I add my high cardinality distinct_id to the GROUP BY, the queries are much slower (45 to 70 seconds):
SELECT
date_trunc('week', timestamp) AS "timestamp"
, distinct_id
, count(*) AS "count"
FROM telemetry_events
WHERE (
(event = 'view load' OR event = 'view:loaded')
AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
)
GROUP BY (
date_trunc('week', timestamp)
, distinct_id
)
ORDER BY
date_trunc('week', timestamp) ASC
;
This is the EXPLAIN ANALYZE:
Finalize GroupAggregate (cost=88115.68..110676.54 rows=181418 width=52) (actual time=71956.298..72028.293 rows=1326 loops=1)
Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
-> Gather Merge (cost=88115.68..107274.48 rows=151244 width=52) (actual time=71956.254..72027.304 rows=2526 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=87115.65..88817.15 rows=75622 width=52) (actual time=71740.225..71758.172 rows=842 loops=3)
Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
-> Sort (cost=87115.65..87304.71 rows=75622 width=44) (actual time=71736.479..71743.974 rows=38808 loops=3)
Sort Key: (date_trunc('week'::text, "timestamp")), distinct_id
Sort Method: quicksort Memory: 4520kB
Worker 0: Sort Method: quicksort Memory: 4507kB
Worker 1: Sort Method: quicksort Memory: 4679kB
-> Parallel Index Scan using idx_timestamp on telemetry_events (cost=0.43..80987.81 rows=75622 width=44) (actual time=402.899..71576.001 rows=38808 loops=3)
Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
Rows Removed by Filter: 213265
Planning Time: 1.942 ms
Execution Time: 72029.136 ms
What would be the best way to try to optimize these queries?
And some sub-questions: is my expectation of keeping this query under ~10 seconds reasonable? How fast can I make PostgreSQL for these kind of OLAP workloads?