correct handling of duplicate rows in database in .NET

Question

Say I have hundreds of thousands of records in a text file which I'd like to insert into the database every day. Of which around half of them already exist within the database. Also an unique row is defined using say 6 columns.

What is the correct way to code the insert in .NET in this particular case? The two which I'm wondering over are:

Do I SQL-insert straight away and catch the SQLException for duplicate entries? In this case, I'd be breaking the concept that Exceptions should be used only for exceptional cases and not for the frequent cases.

or

Do I do a SQL-select first to check for the row before I do an insert? In this case, it'd seem that the database will do the insert and check for the uniqueness a second time automatically despite having just completed a select.

What are you using, ado.net/ef/stored procedure/inline sql?

uv_man
– uv_man

2013-02-16 10:54:26 +00:00
Commented Feb 16, 2013 at 10:54 — uv_man
– uv_man, Commented Feb 16, 2013 at 10:54

sga101 · Accepted Answer · 2013-02-16 12:00:37Z

1

Use a sql statement that checks for the row before inserting it. Here is a simple example for a table called person with two columns, forename and surname which are checked for uniqueness:

/// <summary>
/// Insert a row into the person table
/// </summary>
/// <param name="connection">An open sql connection</param>
/// <param name="forename">The forename which will be inserted</param>
/// <param name="surname">The surname which will be inserted</param>
/// <returns>True if a new row was added, False otherwise</returns>
public static bool InsertPerson(SqlConnection connection, string forename, string surname)
{
    using (SqlCommand command = connection.CreateCommand())
    {
        command.CommandText =
            @"Insert into person (forename, surname)
                Select @forename, @surname
                Where not exists 
                    (
                        select 'X' 
                        from person 
                        where 
                            forename = @forename 
                            and surname=@surname
                    )";
        command.Parameters.AddWithValue("@forename", forename);
        command.Parameters.AddWithValue("@surname", surname);

        int rowsInserted = command.ExecuteNonQuery();

        // rowsInserted will be 0 if the row is already in the database
        return rowsInserted == 1;
    }
}

edited Feb 16, 2013 at 12:00

answered Feb 16, 2013 at 11:10

sga101

1,90413 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

CodeCaster Over a year ago

You don't want to open a connection for each insert.

sga101 Over a year ago

The code sample is the simplest thing that will work. Plenty of optimisations are possible; my aim was to concisely demonstrate all the required concepts so that anyone who looks at this answer will be able to make use of it.

sga101 Over a year ago

I've amended the code sample so that it takes an open connection instead of creating an opening one as suggested by CodeCaster

Maris · Accepted Answer · 2013-02-16 10:43:21Z

0

I think you should chose exception way. Just do something like that:

foreach(var elem in elemntsFromFile)
{
    try
    {
       context.sometable.Add(elem);
       context.SaveChanges();
    }
    catch
    {
    }
}

One moment. I dodnt like that db.saveChanges runs in every iteration, but it will on 100% will have better performance then "the way of select-first". It will work and work as well.

answered Feb 16, 2013 at 10:43

Maris

4,7866 gold badges41 silver badges71 bronze badges

Comments

Phil · Accepted Answer · 2013-02-16 13:14:31Z

0

A simple way to ignore the duplicates is to create your unique index with option IGNORE_DUP_KEY=ON. You won't then incur the overhead of testing for duplicates or catching exceptions.

e.g.

CREATE UNIQUE NONCLUSTERED INDEX [IX_IgnoreDuplicates] ON [dbo].[Test]
(
    [Id] ASC,
    [Col1] ASC,
    [Col2] ASC
)
WITH (IGNORE_DUP_KEY = ON)

Also you can then use BULK INSERT to efficiently load all of your data with automatic duplicate removal.

See CREATE INDEX

edited Feb 16, 2013 at 13:14

answered Feb 16, 2013 at 13:03

Phil

43.2k9 gold badges102 silver badges102 bronze badges

Collectives™ on Stack Overflow

correct handling of duplicate rows in database in .NET

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related