We have a table of scientific data that's sampled twice a day across about 10,000-12,000 sensors across the country. Each sensor pings some data to us that is put into this table.
A simplified version is below - sensor_rt_data:
id | BIGINT PK
sensor-name | STRING
location-id | INT FK
sensor-value | NUMERIC(0,2)
last-updated | TIMESTAMP_WITH_TIMEZONE
Unfortunately, there's duplicated samples throughout the day that I'm trying to remove. Eg.
Simplified for (location-id last column):
2017-03-30 06:30 | 49.00 | 1
2017-03-30 06:30 | 37.00 | 2
2017-03-30 10:30 | 51.00 | 1
2017-03-30 10:30 | 35.00 | 2
2017-03-30 15:30 | 51.00 | 1
2017-03-30 15:30 | 35.00 | 2
2017-03-30 18:30 | 51.00 | 1
2017-03-30 20:30 | 42.00 | 1
Trying to cull the three 51s to just show 1; I can remove duplicates with distinct but I'm not sure how I go about removing only duplicates in series so it looks like this:
2017-03-30 06:30 | 49.00 | 1
2017-03-30 06:30 | 37.00 | 2
2017-03-30 18:30 | 51.00 | 1
2017-03-30 15:30 | 35.00 | 2
2017-03-30 20:30 | 42.00 | 1
I also had a look and play with Deleting Duplicates on the wiki but my query didn't seem to delete the series data.
Before you suggest it, we can't ignore the duplicates at the source (that would be lovely, I'm totally sensing that!) due to some legal kerfuffle that I'm not privvy to.
Would SQL be able to handle that sort of deduping, or would I have to move that data to another table? We've had this running for 6 months and the table size is getting big and most of it is unnecessary ping data.
EDIT: For clarification, this is a big table of many records, I was trying to remove all duplicates that duplicate the previous "latest" (but only checking certain fields - locationid, sensor-value and last updated) if that makes sense.
If this was done outside of SQL, I could load each row (ordered by date ASC) and store the "latest" reading in an array for each location-id, if the retrieved row matches the same sensor-value as the last one for that location-id, I'd discard it.
At the end, I should have data that doesn't duplicate the sensor-value across time and only stores changes in the sensor values (which are what's relevant).
EDIT
Thanks to the answer below, I've got it working, however...
So after some tweaking of the query, i've got it going on our dataset. However, I'm noticing that these sensor readings are down to just two records...
2017-02-28 00:00:00 144
2017-02-27 00:00:00 139
2017-02-26 00:00:00 139
.. 20 more at 139
2017-02-14 00:00:00 129
...10 more at 129
turns into:
2017-02-28 00:00:00 144
2017-02-14 00:00:00 129
I'm expecting the 139 to make an appearance there? The example works fine from the accepted answer tho.