1

I need to delete certain records from my table that I consider "duplicates". They're not exactly duplicates as not every column value are the same. Rather, the logic is something like this:

If col_a and col_b have the same value across several rows, and col_c (which is a timestamp) is within, say, 5 minutes of each other, then delete all rows except the row with the earliest timestamp.

Example Data:

id    col_a    col_b     col_c
1     foo      bar       2016-01-01 00:00:00
2     foo      bar       2016-01-01 00:00:12
3     foo      bar       2016-01-01 00:00:22
4     foo      bar       2016-01-05 00:00:00
5     apple    banana    2016-01-01 00:00:00
6     apple    banana    2016-01-05 00:00:00

In the above example, I want to delete id = 2 and id = 3. Is this possible to do in MySQL?

5
  • What if you have multiple records with same cola_a and col_b. The time difference between the 1st and last records is let's say 10 minutes (outside of the 5 min tolerance), but there is less than 5 minute difference between the consecutive records? Do you delete all, but the earlies records, or after 5 minutes from the earliest one do you want to delete the other ones? Would it be acceptable to delete all duplicates for every 5 minute intervals and preserve only the earliest timestamp from that interval? Commented Mar 18, 2016 at 18:26
  • Yes, I think so. "Legitimate" records are at least 2 hours apart, and often days apart. Due to a very odd bug in my app, more than the necessary records are getting inserted. It's not really causing any problems in the application, but I just want to clean up the table a bit. Commented Mar 18, 2016 at 18:37
  • So, as confirmation... if we have records (same col_a and col_b), with col_c times in a series that are 4 minutes apart... 06:15, 08:30, 08:34, 08:38, 08:42, etc. we'd keep the 06:15, and the 08:30, but delete the 08:34, 08:38, 08:42. That is... as long as there is another row (same col_a,col_b) that is within the previous five minutes, we should remove that record. Even if that previous record is also going to be deleted. Commented Mar 18, 2016 at 20:03
  • I think you're right. There are over a million records in this particular table I need to clean, so it's hard to say for sure. I need to investigate. Commented Mar 18, 2016 at 20:51
  • I suggest you first write a SELECT statement that identifies the rows to be removed, and once you have that tested and verify that it's returning the rows you want, then convert that into a DELETE. Commented Mar 18, 2016 at 21:17

1 Answer 1

1

I think this could do the trick

DELETE FROM tab
WHERE ID IN(
select t1.id
FROM tab as t1 JOIN tab as t2
ON t1.col1=t2.col1 AND t1.col2 = t2.col2
WHERE DATE_DIFF(MINUTE, t1.col3, t2.col3) < 5 
AND DATE_DIFF(MINUTE, t1.col3, t2.col3) > 0) 

Join the table and get all the duplicates. In those duplicates select only the ones that satisfy the time constraint. Note: > 0 and not >=0

Sign up to request clarification or add additional context in comments.

1 Comment

I think the DATE_DIFF function is specific to SQL Server. The closest equivalent in MySQL is TIMESTAMPDIFF. Though in MySQL, we'd likely do the comparisons without the function, just using expressions like d.col_c < t.col_c + INTERVAL 5 MINUTE

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.