4

UPDATE: Ended up using this method created by Johnny Bubriski and then modified it a bit to skip duplicates. Works like a charm and is apparently quite fast. Link: http://johnnycode.com/2013/08/19/using-c-sharp-sqlbulkcopy-to-import-csv-data-sql-server/

I have been searching for an answer to this but cannot seem to find it. I am doing a T-SQL bulk insert to load data into a table in a local database from a csv file. My statement looks like this:

BULK INSERT Orders
FROM 'csvfile.csv'
WITH(FIELDTERMINATOR = ';', ROWTERMINATOR = '0x0a', FORMATFILE = 'formatfile.fmt', ERRORFILE = 'C:\\ProgramData\\Tools_TextileMagazine\\AdditionalFiles\\BulkInsertErrors.txt')
GO

SELECT * 
FROM Orders
GO

I get an exception when I try to insert duplicate rows (for example taking the same csv file twice) which causes the entire insert to stop and rollback. Understandable enough since I am violating the primary key constraint. Right now I just show a messagebox to let users know that duplicates are present in the csv file, but this is of course not a proper solution, actually not a solution at all. My question is, is there any way to ignore these duplicate rows and just skip them and only add the rows that are not duplicate? Perhaps in a try catch somehow?

If it is not possible, what would be the "correct" (for lack of a better word) way to import data from the csv file? The exception is causing me a bit of trouble. I did read somewhere that you can set up a temporary table, load the data into it and select distinct between the two tables before inserting. But is there really no easier way to do it with bulk insert?

3
  • 1
    Import the data into a temporary table, clean them, then insert them to the destination. Otherwise use SSIS to do the cleaning as part of the import flow Commented Dec 15, 2014 at 13:19
  • St0ffer - Can you elaborate on how you modified the solution to skip duplicates? Commented Jun 29, 2020 at 15:45
  • @Ryan sorry to disappoint, but I don't remember any of this after ~6 years ;) Commented Jul 2, 2020 at 11:28

2 Answers 2

3

You could set the MAXERRORS property to quite a high which will allow the valid records to be inserted and the duplicates to be ignored. Unfortunately, this will mean that any other errors in the dataset won't cause the load to fail.

Alternatively, you could set the BATCHSIZE property which will load the data in multiple transactions therefore if there are duplicates it will only roll back the batch.

A safer, but less efficient, way would be to load the CSV file in to a separate, empty, table and then merge them into your orders table as you mentioned. Personally, this is the way I'd do it.

None of these solutions are ideal but I can't think of a way of ignoring duplicates in the bulk insert syntax.

Sign up to request clarification or add additional context in comments.

5 Comments

Yeah I do not really like the idea of the inconsistency that can be potentially caused by your two first suggestions. I guess I will try and go with the dummy table.
I'll put yours as the answer since it was technically correct even though I stepped away from using bulk insert in the end.
@St0ffer what did you use instead ?
@MohammadAli See the update at the top of my question :)
just want to clarify: I found this post after 7 years and tried using MAXERRORS to skip the duplicate rows, but it can not. Actually, the whole BULK INSERT statement will throw an exception if there are primary constraint violations. Also in the official doc, it means MAX_ERRORS option does not apply to constraint checks.
1

First of all there is no direct solution like BULK INSERT WHERE NOT EXISTS. You can use following solutions.

When using BULK INSERT there are two scenarios

  1. You are BULK INSERTing in a empty table
  2. You are BULK INSERTing in a already filled table

Solution for Case 1: set the MAXERRORS = 0 set the BATCHSIZE = total number of rows in csv file

Using above statement with BULK INSERT would cause whole BULK INSERT operation to roll back even if there is a single error, this would prevent from rows being imported even when there are errors in few rows. You would need to solve all import errors to complete the import operation. This method would prevent situations when you import 50 rows, 30 gets imported and remaining not. Then you have to search CSV file for the failed ones and reimport them or delete all imported rows from SQL table and do BULK INSERT again.

Solution for Case 2: 1> You can run select query on the existing table, right click and export in CSV. If you have any spreadsheet program then paste the data below import data and use conditional formatting on primary key column to highlight duplicate rows and delete them. Then use BULK INSERT operation.

2> Set MAXERRORS = number of rows, and import csv file using BULK INSERT. This is not safe and recommended way, as there might be other kinds of errors apart from duplicate key errors

3> Set BATCHSIZE = 1 and MAXERRORS = high number, and import csv file using BULK INSERT. This would import all the rows without errors and any row with errors would be skipped. This is useful if the data set is small and you can visually identify the rows which are not imported by observing table columns like id number column which shows the missing numbers.

4> Right click on existing table, choose table as > Crete to > new query window. Simply rename the table name and change to staging name like table_staging. BULK Insert in the staging table and then run a second query to copy data from staging table to main table and using WHERE clause to check if row/pk exists. This is a much safer way but forces you to create a staging table.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.