Efficient sensor data storage

Question

So, I have a task to produce design to store data retrieved from sensors, like temperature, pressure, roll etc.

First prototype was quite simple, I created TimeSync table consisting of 2 columns: ID increment, and Time. Then for each value I would create table of 2 columns: foreign key to timesync id, and value as float.

Reading back data was quite easy, I would filter time sync table from date ranges I'm interested in, and join other tables which I want to read back. The issue was disk space usage. We are quite limited on disk space, and storing 12 parameters for a year at 1Hz logging used ~100gb on Sqlite.

So we made a decision to move to postgresql and apply a little more complicated logic. Thing is, that there is quite a lot of duplicated data, let's say temperature. It doesn't change every 1s, it might chance once a minute, so there is no need to store it so often. The perfect solution would be to store first value received, then on next value check if it has changed, and if so, only then write to database.

So, that means there can be different frequencies at which values are stored, let's say a roll changes every 1s, and temperature changes every 60s. Now my issue is how to combined that data into single query.

My design so far is to store when device was online, and when it was offline. This will provide clues later on how to do proper filtering.

Next, to store each value in different table, consisting of 2 columns: Time and value itself.

So, some examples, activity table looks like:

ID  DevID   Online              Offline
1   1       2017-01-16 16:13:46 2017-01-16 16:24:38
13  1       2017-01-16 16:32:51 2017-01-16 16:42:16

and data tables: Data Table1:

Time                Value
2017-01-16 16:13:59 18.9
2017-01-16 16:14:20 17.9
2017-01-16 16:15:08 19.9

Data Table2:

Time                Value
2017-01-16 16:13:57 348
2017-01-16 16:14:05 350
2017-01-16 16:14:17 353

I'm using generate_series from postgresql, and it looks like it is what I need:

select *
--This will generate a series for specified range at specified interval
from generate_series(
                 '2017-01-16 16:10' :: timestamp,
                 '2017-01-16 19:00' :: timestamp,
                 '1 second' :: interval) as date
    --Joining one data table
    left outer join
    (select
         date_trunc('second', data.device_2_p_3.timestamp) as val1time,
         avg(value)                                        as val1avg
     from data.device_2_p_3
     group by val1time) results on (date = results.val1time)
    --Joining other data table
    left outer join
    (select
         date_trunc('second', data.device_1_p_1.timestamp) as val2time,
         avg(value)                                        as val2avg
     from data.device_1_p_1
     group by val2time) results2 on (date = results2.val2time)

order by date asc;

And the readback is:

date    val1time    val1avg val2time    val2avg
14:00.0 null        null    null        null
14:01.0 null        null    14:01.0     349
14:02.0 14:02.0     18.8    null        null
14:03.0 null        null    14:03.0     349.5
14:04.0 null        null    null        null

The issue is that I'm not able to interpolate or repeat data from previous value if device was active at that point. Any clues how to solve this, or suggestions how to improve design would be highly appreciated.

Can you explain how you define "device online"? It means it does not produce readings? — joanolo
– joanolo, Commented Jan 16, 2017 at 17:17
More or less. On first received value, I insert new row in device activity table, where online=offline (timestamp). Now, on next received value, I check how much time has passed since previous value. If the value is less then lets say 5s, I threat that device is online, and update offline column in the last activity row for that device. So If I need to check how long device has been active for I can just subtract online time from offline. If more time has passed, I create new record, where online = offline, and continue on. — Jeffers
– Jeffers, Commented Jan 16, 2017 at 17:47

joanolo · Accepted Answer · 2017-01-16 17:43:25Z

This is one possible approach, although it does not produce exactly what you're asking for, it may be a proxy to get there:

-- Let's assume you have your data in "table_1" and "table_2"
-- They should have an index on column "time"
WITH table_1 (time, value_1) AS
(
VALUES
    ('2017-01-16 16:13:59'::timestamp, 18.9),
    ('2017-01-16 16:14:20', 17.9),
    ('2017-01-16 16:15:08', 19.9)
)
, table_2 (time, value_2) AS
(
VALUES
    ('2017-01-16 16:13:57'::timestamp, 348),
    ('2017-01-16 16:14:05', 350),
    ('2017-01-16 16:14:17', 353)
)

(I've changed "value" to "value_1" and "value_2" for ease)

At this point, we do two things: a UNION of table_1 and table_2 to get all the existing "time" values, then, for each time, make a subquery looking for the value_1 and value_2 closest in time...

-- We first make a UNION to have all present times, then, for each time a SUBQUERY, finding
-- the closest value ORDERing BY time
SELECT
    time, 
    (SELECT value_1 
       FROM table_1 
      WHERE table_1.time <= times.time 
   ORDER BY time DESC 
      LIMIT 1
    ) AS value_1,
    (SELECT value_2 
       FROM table_2 
      WHERE table_2.time <= times.time 
   ORDER BY time DESC 
      LIMIT 1
    ) AS value_2
FROM
    (
    SELECT "time" FROM table_1
    UNION DISTINCT
    SELECT "time" FROM table_2
    ) AS times
ORDER BY
    time ;

The result that you would get is:

|---------------------+---------+---------|
|        time         | value_1 | value_2 |
|---------------------+---------+---------|
| 2017-01-16 16:13:57 |         |   348   |
|---------------------+---------+---------|
| 2017-01-16 16:13:59 |  18.9   |   348   |
|---------------------+---------+---------|
| 2017-01-16 16:14:05 |  18.9   |   350   |
|---------------------+---------+---------|
| 2017-01-16 16:14:17 |  18.9   |   353   |
|---------------------+---------+---------|
| 2017-01-16 16:14:20 |  17.9   |   353   |
|---------------------+---------+---------|
| 2017-01-16 16:15:08 |  19.9   |   353   |
|---------------------+---------+---------|

This is probably not very efficient, but it does (more or less) what you're looking for. It requires that you have an index (DESC) on the "time" column; or the efficiency will be absolutely horrible.

If you want, you can change the "times" query by a generate_series call. It will also work.

NOTE: value_1 on first column doesn't have a value... because we have not yet any reading. For this to work, you would need a whole set of readings at "start time".

Does the trick. And works nicely with generate_series. Thank you. — Jeffers
– Jeffers, Commented Jan 17, 2017 at 15:31

Stack Exchange Network

Efficient sensor data storage

1 Answer 1

Your Answer

Hot Network Questions

Efficient sensor data storage

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions