3

I've got a SQL Server db with quite a few dupes in it. Removing the dupes manually is just not going to be fun, so I was wondering if there is any sort of sql programming or scripting I can do to automate it.

Below is my query that returns the ID and the Code of the duplicates.

select a.ID, a.Code
from Table1 a
inner join (
SELECT Code
FROM Table1 GROUP BY Code HAVING COUNT(Code)>1)
x on x.Code= a.Code

I'll get a return like this, for example:

5163    51727
5164    51727
5165    51727
5166    51728
5167    51728
5168    51728

This snippet shows three returns for each ID/Code (so a primary "good" record and two dupes). However this isnt always the case. There can be up to [n] dupes, although 2-3 seems to be the norm.

I just want to somehow loop through this result set and delete everything but one record. THE RECORDS TO DELETE ARE ARBITRARY, as any of them can be "kept".

4 Answers 4

4

You can use row_number to drive your delete. ie

CREATE TABLE #table1
(id INT,
code int
);

WITH cte AS 
(select a.ID, a.Code, ROW_NUMBER() OVER(PARTITION by COdE ORDER BY ID) AS rn
from #Table1 a
)
DELETE x
FROM #table1 x
JOIN cte ON x.id = cte.id
WHERE cte.rn > 1

But... If you are going to be doing a lot of deletes from a very large table you might be better off to select out the rows you need into a temp table & then truncate your table and re-insert the rows you need. Keeps the Transaction log from getting hammered, your CI getting Fragged and should be quicker too!

Sign up to request clarification or add additional context in comments.

Comments

1

It is actually very simple:

DELETE FROM Table1
WHERE ID NOT IN
         (SELECT MAX(ID)
          FROM Table1
          GROUP BY CODE)

2 Comments

This looks like it would work but I would change the first line to SELECT a.ID, a.Code FROM Table1 and check that the results and count match what you're expecting from your existing query first :)
I agree and always do that, or even SELECT * INTO #Table1Backup FROM Table1; before hand.
0

Self join solution with a performance test VS cte.

    create table codes(
id int IDENTITY(1,1) NOT NULL,
code int null,
 CONSTRAINT [PK_codes_id] PRIMARY KEY CLUSTERED 
(
    id ASC
))

declare @counter int, @code int
set @counter = 1
set @code = 1
while (@counter <= 1000000)
begin
    print ABS(Checksum(NewID()) % 1000)
    insert into codes(code) select ABS(Checksum(NewID()) % 1000)
    set @counter = @counter + 1
end
GO

set statistics time on;
    delete a 
    from codes a left join(
    select MIN(id) as id from codes
    group by code) b
    on a.id = b.id
    where b.id is null
set statistics time off;

--set statistics time on;
--  WITH cte AS 
--  (select a.id, a.code, ROW_NUMBER() OVER(PARTITION by code ORDER BY id) AS rn
--  from codes a
--  )
--  delete x
--  FROM codes x
--  JOIN cte ON x.id = cte.id
--  WHERE cte.rn > 1
--set statistics time off;

Performance test results: With Join:

 SQL Server Execution Times:
   CPU time = 3198 ms,  elapsed time = 3200 ms.

(999000 row(s) affected)

With CTE:

 SQL Server Execution Times:
   CPU time = 4197 ms,  elapsed time = 4229 ms.

(999000 row(s) affected)

2 Comments

Why would performance be important for a problem like this? Sounds like a 1 off done by the admin.
It's important to know that CTE's are slow. Other people might be looking at this question in the future and they might have different problems to solve.
0

It's basically done like this:

WITH CTE_Dup AS
 (
 SELECT*,
 ROW_NUMBER()OVER (PARTITIONBY SalesOrderno, ItemNo ORDER BY SalesOrderno, ItemNo) 
AS ROW_NO
 from dbo.SalesOrderDetails  
)
DELETEFROM CTE_Dup WHERE ROW_NO > 1; 

NOTICE: MUST INCLUDE ALL FIELDS!!

Here is another example:

CREATE TABLE #Table (C1 INT,C2 VARCHAR(10))

INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (2,'Oracle')

SELECT * FROM #Table

;WITH Delete_Duplicate_Row_cte
     AS (SELECT ROW_NUMBER()OVER(PARTITION BY C1, C2 ORDER BY C1,C2) ROW_NUM,*
         FROM   #Table )
DELETE FROM Delete_Duplicate_Row_cte WHERE  ROW_NUM > 1

SELECT * FROM #Table

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.