Select most recent non-null value for multiple columns

Question

I've been stumped trying to optimize this query and was hoping some of you database wizards might have some insight. Here is the setup.

Using TimescaleDB as my database, I have a wide table containing sensor data, it looks like the below:

time	sensor_id	wind_speed	wind_direction
'2023-12-18 12:15:00'	'1'	NULL	176
'2023-12-18 12:13:00'	'1'	4	177
'2023-12-18 12:11:00'	'1'	3	NULL
'2023-12-18 12:09:00'	'1'	8	179

I want to write a query which gives me the most recent non-null value for a set of columns, filtered on sensor_id. For the above data (filtering on sensor_id 1), this query should return

wind_speed	wind_direction
4	176

With that being said, my query looks like the below (when querying for sensor_ids in batches of 10):

SELECT
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '1' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction,

    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_two,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '2' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_two,
    .
    .
    .
    (SELECT wind_speed FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_speed IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_speed_ten,
    (SELECT wind_direction FROM sensor_data WHERE sensor_id = '10' AND "time" > now()-'7 days'::interval AND wind_direction IS NOT NULL ORDER BY "time" DESC LIMIT 1) as wind_direction_ten;

The table I am querying against has 1,000 unique sensor_ids, all of which report data at a 2 minute interval. Hence, we are talking 100s of millions of rows.

I've created an index on (sensor_id, time DESC) to further optimize the query. With the index, this query is taking roughly 400ms and 50ms planning and execution time respectively.

How can I write the query differently (or add indexes) to achieve optimal planning and execution time?

Do you want recent values for 10 given sensors or for all? Is there a table sensor with one rows for every relevant sensor_id? How often do you query? Are rows immutable once written (and never deleted)? — Erwin Brandstetter
– Erwin Brandstetter, Commented Dec 19, 2023 at 1:46
@ErwinBrandstetter Sounds good! I'll follow those specs next time. In regards to your second comment, it looks like you don't need this info anymore, as you've answered the question (splendidly if I might add). Anyways, I'll still answer the questions in case it is helpful for others. -- The former -- There is not, but I could make one -- At the moment, a maximum of 100 queries every 15ish minutes (with the queries happening at around the same time, when made) -- Yes -- — gandalf
– gandalf, Commented Dec 19, 2023 at 19:37

Erwin Brandstetter · Accepted Answer · 2023-12-19 22:19:28Z

4

Unfortunately, Postgres does not (yet, as of pg 16) implement IGNORE NULLS for window functions. That would allow a simple call of first_value() for each value column. See:

Solutions

fiddle

There are various shorter and possibly (much) faster options.
You should at least have a (partial) index on (ts). Possibly on (sensor_id, ts). Or more. See below. All depending on undisclosed details.

I find the name "time" for a timestamp column misleading. Using "ts" instead.

`first_value()` + `DISTINCT ON`

A shorter drop-in replacement.

SELECT DISTINCT ON (sensor_id)
       sensor_id
     , first_value(wind_speed    ) OVER (w ORDER BY wind_speed     IS NULL, ts DESC) AS wind_speed
     , first_value(wind_direction) OVER (w ORDER BY wind_direction IS NULL, ts DESC) AS wind_direction
--   , ... more?
FROM   sensor_data
WHERE  ts > LOCALTIMESTAMP - interval '7 days'
WINDOW w AS (PARTITION BY sensor_id);

About DISTINCT ON:

Select first row in each GROUP BY group?

`count()` window function in subquery + filtered aggregate in main

SELECT sensor_id
     , min(wind_speed)     FILTER (WHERE ws_ct = 1) AS wind_speed
     , min(wind_direction) FILTER (WHERE wd_ct = 1) AS wind_direction
--   , ... more?
FROM  (
   SELECT *
        , count(wind_speed)     OVER w AS ws_ct
        , count(wind_direction) OVER w AS wd_ct
   --   ,  ... more?
   FROM   sensor_data
   WHERE  ts > LOCALTIMESTAMP - interval '7 days'
   WINDOW w AS (PARTITION BY sensor_id ORDER BY ts DESC)
   ) sub
GROUP  BY sensor_id;

See:

Simpler based on "sensor" table

If you also have a table "sensor" with one row per relevant sensor_id (like you probably should), it gets simpler:

SELECT sensor_id
    , (SELECT wind_speed     FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_speed     IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_speed
    , (SELECT wind_direction FROM sensor_data WHERE sensor_id = s.sensor_id AND ts > t.ts_min AND wind_direction IS NOT NULL ORDER BY ts DESC LIMIT 1) AS wind_direction
--  , ... more?
FROM   sensor s
    , (SELECT LOCALTIMESTAMP - interval '7 days') t(ts_min)
;

The last query (like your verbose original) can use customized indexes. Ideally, partial indexes - while there are many rows per sensor, few value columns, many null values and many outdated rows.

CREATE INDEX sensor_data_wind_speed_idx     ON sensor_data (sensor_id, ts DESC, wind_speed)
WHERE  wind_speed IS NOT NULL
AND    ts > '2023-12-12 00:00';  -- constant!

CREATE INDEX sensor_data_wind_direction_idx ON sensor_data (sensor_id, ts DESC, wind_direction)
WHERE  wind_direction IS NOT NULL
AND    ts > '2023-12-12 00:00';  -- constant!

Use a constant that's one week in the past at creation time. The index grows in size over time, but stays applicable. Recreate indexes with later cut-off from time to time to keep the size at bay. (Not sure if the timestamp bound pays for your hypertables, though. Plain indexes may be good enough. I had plain Postgres in mind.)

Then run the same query, but with a constant timestamp:

SELECT ...
FROM   sensor s
    , (SELECT timestamp '2023-12-12 03:47:16') t(ts_min)  -- MUST be a constant to use partial index!
;

Sorted subquery + `first()` aggregate function

If index-support is not an option or not efficient, the most convenient query would be with the aggregate function first() - probably fastest, too, if you use the C version from the additional module first_last_agg. See:

Get values from first and last row per group

Required once per DB:

CREATE EXTENSION first_last_agg;

SELECT sensor_id
     , first(wind_speed    ) FILTER (WHERE wind_speed IS NOT NULL)     AS wind_speed
     , first(wind_direction) FILTER (WHERE wind_direction IS NOT NULL) AS wind_direction
--   , ... more?
FROM   (
   SELECT * FROM sensor_data
   WHERE  ts > LOCALTIMESTAMP - interval '7 days'
   ORDER  BY sensor_id, ts DESC
   ) s
GROUP  BY 1;

edited Dec 19, 2023 at 22:19

answered Dec 19, 2023 at 4:10

Erwin Brandstetter

669k160 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

MatBailie Over a year ago

My recollection of timescaledb is that it automatically partitions and indexes the data anyway?

Erwin Brandstetter Over a year ago

@MatBailie: Right, "hypertables" are partitioned into time-based chunks anyway. I had plain Postgres in mind. That said, the right choice of additional partitioning columns still matters in TimescaleDB.

Erwin Brandstetter Over a year ago

@gandalf: Which one did you end up using / was the fastest for you?

gandalf Over a year ago

@ErwinBrandstetter I went with Sorted subquery + first() aggregate function. The resulting planning/execution time went from 400ms/50ms to 5ms/200ms. This is with an index on (sensor_id, time DESC). This speedup is significant, thanks for the help.

Erwin Brandstetter Over a year ago

@gandalf: It only does a single sort. With few nulls, like you commented, this is hard to beat - except by a smarter pre-sort. Note the adapted ORDER BY sensor_id, ts DESC above. Should be a bit faster, generally - and significantly faster for your given index in combination with a small selection of sensor_ids like in the question. While processing all sensors at once, an index on just (ts) and ORDER BY ts DESC might be better. I am not too sure about specifics of TimescaleDB.

James · Accepted Answer · 2023-12-20 21:50:44Z

4

Expanding the solution you chose from the amazing list given by @ErwinBrandstetter

Because you're using TimescaleDB you don't actually need the first_last_agg extension because you already have a (slightly different) first agg.

That query can actually be simplifed down to:

SELECT sensor_id,
      last(wind_speed,ts) FILTER (WHERE wind_speed IS NOT NULL) AS wind_speed
    , last(wind_direction,ts) FILTER (WHERE wind_direction IS NOT NULL) AS wind_direction
FROM  sensor_data
WHERE  ts > LOCALTIMESTAMP - interval '7 days'
GROUP BY 1;

Based on your feedback about your original planning time being 400ms I do wonder how many chunks your Timescale hypertable has? I think you could probably optimize here!

Another avenue for optimization is to compress this data. When I did a test I dropped my storage required for my data by 8x my query speed (for the query above) by 3x.

I compressed segmenting by sensor_id and ordering by time DESC, wind_speed, wind_direction.

answered Dec 20, 2023 at 21:50

James

661 bronze badge

1 Comment

gandalf Over a year ago

Ah, thanks for this! Your query halved my planning and execution time from the Sorted subquery + first() aggregate function approach. I also compressed the data, saved me a ton on storage.

Collectives™ on Stack Overflow

Select most recent non-null value for multiple columns

2 Answers 2

Solutions

`first_value()` + `DISTINCT ON`

`count()` window function in subquery + filtered aggregate in main

Simpler based on "sensor" table

Sorted subquery + `first()` aggregate function

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Solutions

first_value() + DISTINCT ON

count() window function in subquery + filtered aggregate in main

Simpler based on "sensor" table

Sorted subquery + first() aggregate function

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related

`first_value()` + `DISTINCT ON`

`count()` window function in subquery + filtered aggregate in main

Sorted subquery + `first()` aggregate function