0

We have a table of scientific data that's sampled twice a day across about 10,000-12,000 sensors across the country. Each sensor pings some data to us that is put into this table.

A simplified version is below - sensor_rt_data:

id | BIGINT PK
sensor-name | STRING
location-id | INT FK
sensor-value | NUMERIC(0,2)
last-updated | TIMESTAMP_WITH_TIMEZONE

Unfortunately, there's duplicated samples throughout the day that I'm trying to remove. Eg.

Simplified for (location-id last column):

2017-03-30 06:30 | 49.00 | 1
2017-03-30 06:30 | 37.00 | 2
2017-03-30 10:30 | 51.00 | 1
2017-03-30 10:30 | 35.00 | 2
2017-03-30 15:30 | 51.00 | 1
2017-03-30 15:30 | 35.00 | 2
2017-03-30 18:30 | 51.00 | 1
2017-03-30 20:30 | 42.00 | 1

Trying to cull the three 51s to just show 1; I can remove duplicates with distinct but I'm not sure how I go about removing only duplicates in series so it looks like this:

2017-03-30 06:30 | 49.00 | 1
2017-03-30 06:30 | 37.00 | 2
2017-03-30 18:30 | 51.00 | 1
2017-03-30 15:30 | 35.00 | 2
2017-03-30 20:30 | 42.00 | 1

I also had a look and play with Deleting Duplicates on the wiki but my query didn't seem to delete the series data.

Before you suggest it, we can't ignore the duplicates at the source (that would be lovely, I'm totally sensing that!) due to some legal kerfuffle that I'm not privvy to.

Would SQL be able to handle that sort of deduping, or would I have to move that data to another table? We've had this running for 6 months and the table size is getting big and most of it is unnecessary ping data.

EDIT: For clarification, this is a big table of many records, I was trying to remove all duplicates that duplicate the previous "latest" (but only checking certain fields - locationid, sensor-value and last updated) if that makes sense.

If this was done outside of SQL, I could load each row (ordered by date ASC) and store the "latest" reading in an array for each location-id, if the retrieved row matches the same sensor-value as the last one for that location-id, I'd discard it.

At the end, I should have data that doesn't duplicate the sensor-value across time and only stores changes in the sensor values (which are what's relevant).

EDIT

Thanks to the answer below, I've got it working, however...

So after some tweaking of the query, i've got it going on our dataset. However, I'm noticing that these sensor readings are down to just two records...

2017-02-28 00:00:00 144
2017-02-27 00:00:00 139
2017-02-26 00:00:00 139
.. 20 more at 139
2017-02-14 00:00:00 129
...10 more at 129

turns into:

2017-02-28 00:00:00 144
2017-02-14 00:00:00 129

I'm expecting the 139 to make an appearance there? The example works fine from the accepted answer tho.

4
  • Does the sensordata need to be unique for one day, or do you want to remove duplicates across all location_ids? Commented Mar 30, 2017 at 6:15
  • duplicates across all location ids, I might update my question to clarify that. Commented Mar 30, 2017 at 6:26
  • I updated my answer - I totally misunderstood this the first time Commented Mar 30, 2017 at 6:44
  • Wow, I'm so sorry but I tried replying to your answer earlier but it was removed, thank-you so much! I'm looking into it now. I may have not properly written it out, so my fault. I was rushing to get out of work. Commented Mar 30, 2017 at 9:04

1 Answer 1

1

Something like:

delete from sensordata s
using (
  select id, 
         sensor_value = lead(sensor_value) over w as same_value_as_next
  from sensordata
  window w as (partition by location_id order by last_updated)
) x
where x.id = s.id
  and x.same_value_as_next
;

However as this is going over all rows, this is not going to be very efficient, but I can't think of a better way right now.

Online example: http://rextester.com/SGPOB26281

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the example, hmmm will this remove historical data?
@LisaAnna: it removes all but one duplicate row for the same location id which is how I understood your question. If you want to limit that to a specific date range, include the appropriate where clause
Ah, I will play around with your example. Thank-you, will update soon.
I think she wants to remove it from the listing not from the database as she said about the "due to some legal kerfuffle". But I may be wrong. Just the select will solve it for her. +1
that is briliant @a_horse_with_no_name and precisely what I was tryign to do in a very complicated SQL flow! With some tweaking I think we can run this over our dataset and rerun outside of our normal hours!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.