Loop through sql result set and remove [n] duplicates

Question

I've got a SQL Server db with quite a few dupes in it. Removing the dupes manually is just not going to be fun, so I was wondering if there is any sort of sql programming or scripting I can do to automate it.

Below is my query that returns the ID and the Code of the duplicates.

select a.ID, a.Code
from Table1 a
inner join (
SELECT Code
FROM Table1 GROUP BY Code HAVING COUNT(Code)>1)
x on x.Code= a.Code

I'll get a return like this, for example:

5163    51727
5164    51727
5165    51727
5166    51728
5167    51728
5168    51728

This snippet shows three returns for each ID/Code (so a primary "good" record and two dupes). However this isnt always the case. There can be up to [n] dupes, although 2-3 seems to be the norm.

I just want to somehow loop through this result set and delete everything but one record. THE RECORDS TO DELETE ARE ARBITRARY, as any of them can be "kept".

dfundako · Accepted Answer · 2015-11-20 17:36:41Z

4

You can use row_number to drive your delete. ie

CREATE TABLE #table1
(id INT,
code int
);

WITH cte AS 
(select a.ID, a.Code, ROW_NUMBER() OVER(PARTITION by COdE ORDER BY ID) AS rn
from #Table1 a
)
DELETE x
FROM #table1 x
JOIN cte ON x.id = cte.id
WHERE cte.rn > 1

But... If you are going to be doing a lot of deletes from a very large table you might be better off to select out the rows you need into a temp table & then truncate your table and re-insert the rows you need. Keeps the Transaction log from getting hammered, your CI getting Fragged and should be quicker too!

edited Nov 20, 2015 at 17:36

dfundako

8,3643 gold badges21 silver badges36 bronze badges

answered Nov 20, 2015 at 17:16

john McTighe

1,2817 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

JBrooks · Accepted Answer · 2015-12-09 03:36:08Z

1

It is actually very simple:

DELETE FROM Table1
WHERE ID NOT IN
         (SELECT MAX(ID)
          FROM Table1
          GROUP BY CODE)

answered Dec 9, 2015 at 3:36

JBrooks

10k2 gold badges31 silver badges32 bronze badges

2 Comments

Quantumplate Over a year ago

This looks like it would work but I would change the first line to SELECT a.ID, a.Code FROM Table1 and check that the results and count match what you're expecting from your existing query first :)

JBrooks Over a year ago

I agree and always do that, or even SELECT * INTO #Table1Backup FROM Table1; before hand.

Void Ray · Accepted Answer · 2015-11-20 20:50:25Z

0

Self join solution with a performance test VS cte.

    create table codes(
id int IDENTITY(1,1) NOT NULL,
code int null,
 CONSTRAINT [PK_codes_id] PRIMARY KEY CLUSTERED 
(
    id ASC
))

declare @counter int, @code int
set @counter = 1
set @code = 1
while (@counter <= 1000000)
begin
    print ABS(Checksum(NewID()) % 1000)
    insert into codes(code) select ABS(Checksum(NewID()) % 1000)
    set @counter = @counter + 1
end
GO

set statistics time on;
    delete a 
    from codes a left join(
    select MIN(id) as id from codes
    group by code) b
    on a.id = b.id
    where b.id is null
set statistics time off;

--set statistics time on;
--  WITH cte AS 
--  (select a.id, a.code, ROW_NUMBER() OVER(PARTITION by code ORDER BY id) AS rn
--  from codes a
--  )
--  delete x
--  FROM codes x
--  JOIN cte ON x.id = cte.id
--  WHERE cte.rn > 1
--set statistics time off;

Performance test results: With Join:

 SQL Server Execution Times:
   CPU time = 3198 ms,  elapsed time = 3200 ms.

(999000 row(s) affected)

With CTE:

 SQL Server Execution Times:
   CPU time = 4197 ms,  elapsed time = 4229 ms.

(999000 row(s) affected)

edited Nov 20, 2015 at 20:50

answered Nov 20, 2015 at 17:46

Void Ray

10.3k4 gold badges36 silver badges55 bronze badges

2 Comments

JBrooks Over a year ago

Why would performance be important for a problem like this? Sounds like a 1 off done by the admin.

Void Ray Over a year ago

It's important to know that CTE's are slow. Other people might be looking at this question in the future and they might have different problems to solve.

Pang · Accepted Answer · 2015-12-19 07:25:34Z

0

It's basically done like this:

WITH CTE_Dup AS
 (
 SELECT*,
 ROW_NUMBER()OVER (PARTITIONBY SalesOrderno, ItemNo ORDER BY SalesOrderno, ItemNo) 
AS ROW_NO
 from dbo.SalesOrderDetails  
)
DELETEFROM CTE_Dup WHERE ROW_NO > 1;

NOTICE: MUST INCLUDE ALL FIELDS!!

Here is another example:

CREATE TABLE #Table (C1 INT,C2 VARCHAR(10))

INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (2,'Oracle')

SELECT * FROM #Table

;WITH Delete_Duplicate_Row_cte
     AS (SELECT ROW_NUMBER()OVER(PARTITION BY C1, C2 ORDER BY C1,C2) ROW_NUM,*
         FROM   #Table )
DELETE FROM Delete_Duplicate_Row_cte WHERE  ROW_NUM > 1

SELECT * FROM #Table

edited Dec 19, 2015 at 7:25

Pang

10.2k146 gold badges87 silver badges126 bronze badges

answered Dec 9, 2015 at 2:56

user5656611

Collectives™ on Stack Overflow

Loop through sql result set and remove [n] duplicates

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related