0

I am trying to import data into an empty SQL server table, avoiding duplicates that exist in the source data.

Currently I am doing a bulk insert into a temp table, and then copying the data across using:

INSERT INTO Actual_table
SELECT * FROM Temp_table

So the Temp_table and Actual_table have the exact same structure, the only difference is that on the PK field on the Actual_table, I have set up the Temp_table with a UNIQUE identifier, and set it to ignore duplicates:

UNIQUE NONCLUSTERED (Col1) WITH (IGNORE_DUP_KEY = ON) 

In other words:

Actual_table

Col1 (PK)    Col2

Temp_table

Col1 (Unique, ignore duplicates)   Col2

The Actual_table is empty when we start this process, and the duplicates to be avoided are only on the PK field (not DISTINCT on the whole row, in other words).

I have no idea if this is the best way to achieve this, and comments/suggestions would be appreciated.

Just to flesh out my thoughts further:

  1. Should I rather import straight to the actual table, adding the IGNORE_DUP_KEY contraint before importing, and then removing it (is this even possible)?
  2. Do I not set up the Temp_table with the IGNORE_DUP_KEY constraint (which makes the bulk import faster), and then tweak the copying across code to ignore the duplicates? If this is a good idea, could someone please show me the syntax to achieve this.

I am using SQL server 2014.

1
  • The way you are going it is is not bad. Have you considered cleaning up the data on the front end before the import? Commented Mar 15, 2016 at 12:08

1 Answer 1

1

If the table is initially empty, then you just remove the duplicates when you load:

INSERT INTO Actual_table
    SELECT DISTINCT *
    FROM Temp_table;

If you only want "distinctness" on a subset of columns, use row_Number():

INSERT INTO Actual_table
    SELECT <col1>, <col2>, . . .
    FROM (SELECT t.*,
                 ROW_NUMBER() OVER (PARTITION BY col ORDER BY (SELECT NULL)) as seqnum
          FROM Temp_table t
         ) t
    WHERE seqnum = 1;
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Gordon. Sorry if this is a stupid question, but do I have to add 'seqnum' as a column to Actual_table? I'm getting a 'Column name or number of supplied values does not match table definition' error.
@Alex . . . No, you just need to list out all the columns. The * was a lazy short-cut that doesn't work right in this case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.