How to import data to an empty SQL server table avoiding duplicates in the source data

Question

I am trying to import data into an empty SQL server table, avoiding duplicates that exist in the source data.

Currently I am doing a bulk insert into a temp table, and then copying the data across using:

INSERT INTO Actual_table
SELECT * FROM Temp_table

So the Temp_table and Actual_table have the exact same structure, the only difference is that on the PK field on the Actual_table, I have set up the Temp_table with a UNIQUE identifier, and set it to ignore duplicates:

UNIQUE NONCLUSTERED (Col1) WITH (IGNORE_DUP_KEY = ON)

In other words:

Actual_table

Col1 (PK)    Col2

Temp_table

Col1 (Unique, ignore duplicates)   Col2

The Actual_table is empty when we start this process, and the duplicates to be avoided are only on the PK field (not DISTINCT on the whole row, in other words).

I have no idea if this is the best way to achieve this, and comments/suggestions would be appreciated.

Just to flesh out my thoughts further:

Should I rather import straight to the actual table, adding the IGNORE_DUP_KEY contraint before importing, and then removing it (is this even possible)?
Do I not set up the Temp_table with the IGNORE_DUP_KEY constraint (which makes the bulk import faster), and then tweak the copying across code to ignore the duplicates? If this is a good idea, could someone please show me the syntax to achieve this.

I am using SQL server 2014.

The way you are going it is is not bad. Have you considered cleaning up the data on the front end before the import? — paparazzo
– paparazzo, Commented Mar 15, 2016 at 12:08

Gordon Linoff · Accepted Answer · 2016-03-15 21:36:05Z

1

If the table is initially empty, then you just remove the duplicates when you load:

INSERT INTO Actual_table
    SELECT DISTINCT *
    FROM Temp_table;

If you only want "distinctness" on a subset of columns, use row_Number():

INSERT INTO Actual_table
    SELECT <col1>, <col2>, . . .
    FROM (SELECT t.*,
                 ROW_NUMBER() OVER (PARTITION BY col ORDER BY (SELECT NULL)) as seqnum
          FROM Temp_table t
         ) t
    WHERE seqnum = 1;

edited Mar 15, 2016 at 21:36

answered Mar 15, 2016 at 11:02

Gordon Linoff

1.3m62 gold badges706 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alex Over a year ago

Thanks Gordon. Sorry if this is a stupid question, but do I have to add 'seqnum' as a column to Actual_table? I'm getting a 'Column name or number of supplied values does not match table definition' error.

Gordon Linoff Over a year ago

@Alex . . . No, you just need to list out all the columns. The * was a lazy short-cut that doesn't work right in this case.

Collectives™ on Stack Overflow

How to import data to an empty SQL server table avoiding duplicates in the source data

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related