Removing duplicate entries in sequence in postgres table

Question

We have a table of scientific data that's sampled twice a day across about 10,000-12,000 sensors across the country. Each sensor pings some data to us that is put into this table.

A simplified version is below - sensor_rt_data:

id | BIGINT PK
sensor-name | STRING
location-id | INT FK
sensor-value | NUMERIC(0,2)
last-updated | TIMESTAMP_WITH_TIMEZONE

Unfortunately, there's duplicated samples throughout the day that I'm trying to remove. Eg.

Simplified for (location-id last column):

2017-03-30 06:30 | 49.00 | 1
2017-03-30 06:30 | 37.00 | 2
2017-03-30 10:30 | 51.00 | 1
2017-03-30 10:30 | 35.00 | 2
2017-03-30 15:30 | 51.00 | 1
2017-03-30 15:30 | 35.00 | 2
2017-03-30 18:30 | 51.00 | 1
2017-03-30 20:30 | 42.00 | 1

Trying to cull the three 51s to just show 1; I can remove duplicates with distinct but I'm not sure how I go about removing only duplicates in series so it looks like this:

2017-03-30 06:30 | 49.00 | 1
2017-03-30 06:30 | 37.00 | 2
2017-03-30 18:30 | 51.00 | 1
2017-03-30 15:30 | 35.00 | 2
2017-03-30 20:30 | 42.00 | 1

I also had a look and play with Deleting Duplicates on the wiki but my query didn't seem to delete the series data.

Before you suggest it, we can't ignore the duplicates at the source (that would be lovely, I'm totally sensing that!) due to some legal kerfuffle that I'm not privvy to.

Would SQL be able to handle that sort of deduping, or would I have to move that data to another table? We've had this running for 6 months and the table size is getting big and most of it is unnecessary ping data.

EDIT: For clarification, this is a big table of many records, I was trying to remove all duplicates that duplicate the previous "latest" (but only checking certain fields - locationid, sensor-value and last updated) if that makes sense.

If this was done outside of SQL, I could load each row (ordered by date ASC) and store the "latest" reading in an array for each location-id, if the retrieved row matches the same sensor-value as the last one for that location-id, I'd discard it.

At the end, I should have data that doesn't duplicate the sensor-value across time and only stores changes in the sensor values (which are what's relevant).

EDIT

Thanks to the answer below, I've got it working, however...

So after some tweaking of the query, i've got it going on our dataset. However, I'm noticing that these sensor readings are down to just two records...

2017-02-28 00:00:00 144
2017-02-27 00:00:00 139
2017-02-26 00:00:00 139
.. 20 more at 139
2017-02-14 00:00:00 129
...10 more at 129

turns into:

2017-02-28 00:00:00 144
2017-02-14 00:00:00 129

I'm expecting the 139 to make an appearance there? The example works fine from the accepted answer tho.

Does the sensordata need to be unique for one day, or do you want to remove duplicates across all location_ids? — user330315
– user330315, Commented Mar 30, 2017 at 6:15
duplicates across all location ids, I might update my question to clarify that. — Lisa Anna
– Lisa Anna, Commented Mar 30, 2017 at 6:26
I updated my answer - I totally misunderstood this the first time — user330315
– user330315, Commented Mar 30, 2017 at 6:44
Wow, I'm so sorry but I tried replying to your answer earlier but it was removed, thank-you so much! I'm looking into it now. I may have not properly written it out, so my fault. I was rushing to get out of work. — Lisa Anna
– Lisa Anna, Commented Mar 30, 2017 at 9:04

score 1 · Accepted Answer · 2017-03-30 06:43:46Z

1

Something like:

delete from sensordata s
using (
  select id, 
         sensor_value = lead(sensor_value) over w as same_value_as_next
  from sensordata
  window w as (partition by location_id order by last_updated)
) x
where x.id = s.id
  and x.same_value_as_next
;

However as this is going over all rows, this is not going to be very efficient, but I can't think of a better way right now.

Online example: http://rextester.com/SGPOB26281

edited Mar 30, 2017 at 6:43

answered Mar 30, 2017 at 6:07

user330315

Sign up to request clarification or add additional context in comments.

5 Comments

Lisa Anna Over a year ago

Thanks for the example, hmmm will this remove historical data?

user330315 Over a year ago

@LisaAnna: it removes all but one duplicate row for the same location id which is how I understood your question. If you want to limit that to a specific date range, include the appropriate where clause

Lisa Anna Over a year ago

Ah, I will play around with your example. Thank-you, will update soon.

Jorge Campos Over a year ago

I think she wants to remove it from the listing not from the database as she said about the "due to some legal kerfuffle". But I may be wrong. Just the select will solve it for her. +1

Lisa Anna Over a year ago

that is briliant @a_horse_with_no_name and precisely what I was tryign to do in a very complicated SQL flow! With some tweaking I think we can run this over our dataset and rerun outside of our normal hours!

Collectives™ on Stack Overflow

Removing duplicate entries in sequence in postgres table

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related