36

I'm using ASP.NET MVC4 with Entity Framework Code First. I have a table called "users", with primary key "UserId". This table may have 200,000+ entries.

I need to insert another 50 users. I might do this like

foreach(User user in NewUsers){
    context.Add(user);
}
dbcontext.SaveChanges();

The problem is, one or more of those new users might already exist in the DB. If I add them and then try to save, it throws an error and none of the valid ones get added. I could modify the code to do this:

foreach(User user in NewUsers){
    if(dbcontext.Users.FirstOrDefault(u => u.UserId) == null)
    {
        dbcontext.Users.Add(user);
    }
}
dbcontext.SaveChanges();

which would work. The problem is, then it has to run a query 50 times on a 200,000+ entry table. So my question is, what is the most performance efficient method of inserting these users, ignoring any duplicates?

3
  • 3
    perhaps context.AddOrUpdate(user); is what you want Commented Aug 7, 2013 at 20:46
  • 1
    @O.O It's not actually users, it's different data that's being pulled from an API. The API may or may not provide the same data in multiple sequential calls. I just used "Users" because it was the first example that I thought of. Commented Aug 7, 2013 at 20:56
  • 1
    I had the same problem and didn't find an adequate solution. There are valid scenarios for this, such as when doing a bulk import from CSV where the email address needs to be unique in an existing database. Reading all existing keys into memory doesn't seem good for performance, neither is adding each entry separately. It seems that what is needed is something like INSERT IGNORE. Commented Oct 10, 2013 at 11:28

6 Answers 6

19

You can do this:

var newUserIDs = NewUsers.Select(u => u.UserId).Distinct().ToArray();
var usersInDb = dbcontext.Users.Where(u => newUserIDs.Contains(u.UserId))
                               .Select(u => u.UserId).ToArray();
var usersNotInDb = NewUsers.Where(u => !usersInDb.Contains(u.UserId));
foreach(User user in usersNotInDb){
    context.Add(user);
}

dbcontext.SaveChanges();

This will execute a single query in your database to find users which already exist, then filter them out of your NewUsers set.

Sign up to request clarification or add additional context in comments.

6 Comments

For large collections, would it be better to use HashSets instead of .ToArray()?
@tbmsu For entity framework calls, it probably wouldn't have any performance effect because the Contains actually gets translated to a SQL IN clause so it's really dependent on DB performance more than anything else. Also note that for large data sets, this probably wouldn't work at all because there's a limit to the number of values you can use in an IN clause (ref).
Could you please clarify what will execute in a single query? Would the savechanges() function cause a single query or the entire snippet will?
@Nexaspx context.Add(user) marks each user entity as being modified, but nothing is actually executed against your database until you run SaveChanges().
@MohammadKhodabandeh Not really. It only selects the subset of UserId's that are in the list of NewUsers. Sure if you're doing something like trying to merge millions of records, there are better solutions, but OP suggested NewUsers had ~50 items, so the maximum size of the result set in usersInDb would also be ~50.
|
4

Since this is your primary key, your options are limited. If this wasn't your primary key, and just a unique index, assuming SQL Server, you could set up your unique key to ignore duplicates.

What I might suggest is to simply wrap a try/catch around the Add and eat the exception if the exception is a duplicate key error.

You might also see if your object supports the AddOrUpdate() method. I know this is supported in Code First implementations. I believe in this case it will do an add on a new or update if the row exists. However, this might still involve a trip to the DB to see if the user already exists in order to know whether to do an add or update. And, in some cases, you might not want to actually perform an update.

I think if it were me, I'd go the Try/Catch route.

2 Comments

So you're saying put the SaveChanges() with a try/catch around it in the for-loop. Would this not add a lot of overhead from having 50 different SaveChanges() calls and their cooresponding transactions and write operations?
@Jordan - Regarding the transactions, there should only be one transaction. You shouldn't have multiple transactions occuring here. Regarding the overhead, yes, there will be some. But, will it be greater than making 50 (or whatever) separate calls to see if each user exists first? The best way to find out is to test it and see. Personally, I think I'd rather take the in-memory overhead of using Try/Catch than making repeated DB calls. But it's going to depend on how many users you typically enter at one time.
3

You can filter out the existing users with one query

foreach(User user in NewUsers.Where(us => !dbcontext.Users.Any(u => u.userId == us.userId)))
{
    dbcontext.Users.Add(user);
}
dbcontext.SaveChanges();

EDIT:

As pointed out in the comments the proposal above will result in an sql call for each element in the NewUsers collection. I could confirm that with SQL Server Profiler.

One intresting result of the profiling is the somewhat wierd sql generated by EF for each item(Model names are different than in the OP, but the query is the same):

exec sp_executesql N'SELECT 
CASE WHEN ( EXISTS (SELECT 
    1 AS [C1]
    FROM [dbo].[EventGroup] AS [Extent1]
    WHERE [Extent1].[EventGroupID] = @p__linq__0
)) THEN cast(1 as bit) WHEN ( NOT EXISTS (SELECT 
    1 AS [C1]
    FROM [dbo].[EventGroup] AS [Extent2]
    WHERE [Extent2].[EventGroupID] = @p__linq__0
)) THEN cast(0 as bit) END AS [C1]
FROM  ( SELECT 1 AS X ) AS [SingleRowTable1]',N'@p__linq__0 int',@p__linq__0=10

Quite a nice piece of code to do the job of a simple one-liner.

My point of view is that writing nice and readable declarative code and let the compiler and optimizer do the dirty job is a great attitude. This is one of the cases when the result of such a style is surprising and you have to go dirty.

7 Comments

Works perfectly, and only needs to run once (the where query only runs one time on the entire table, then for iterates through results). I like it.
@p.s.w.g Nice catch, I edited the answer to fix the syntax errors I noticed.
@Jordan No it doesn't if NewUsers is an in-memory collection. In that case it has to loop through every item and evaluate the Any on each one. This doesn't really improve on your original code, it just rearranges it in a more aesthetically pleasing manner.
Oh I see what your saying. Hmm.
Well I cannot defend my answer right now, I will execute it tomorrow and either confirm the objections of @p.s.w.g and delete the answer or leave it alone
|
2

You can just filter out the failures and keep trying until it succeeds or you get another sort of Exception

public partial class YourEntities: DbContext
{
    public override int SaveChanges()
    {
        var isSaved = false;
        do
        {
            try
            {
                return base.SaveChanges();
            }
            catch (DbUpdateException ex)
            {
                var entries = ex.Entries;
                foreach (var entry in entries)
                {
                    // change state to remove it from context 
                    entry.State = EntityState.Detached;
                }
            }
        }
        while (!isSaved);

        return null;    // never gets here
    }
}

You might want to put in some extra logic to avoid an infinite loop.

2 Comments

Wouldn't this result in entities that should be inserted (non-duplicates) not being inserted?
@Sean No, I think the way it works is that it removes only the problematic ones and keeps retrying until it succeeds. It's a while since I looked at this but that's my opinion.
2

I've found a package https://github.com/artiomchi/FlexLabs.Upsert in the comments for the issue of this feature on the EF Core repo: https://github.com/dotnet/efcore/issues/16949#issuecomment-911120274

By using its NoUpdate() which maps to INSERT IGNORE syntax:

db.Set<Users>().UpsertRange(NewUsers).NoUpdate().Run()

you can just insert entities without handing duplicate errors or doing SELECT ... FOR UPDATE and filter entities that already exist in DB out of your entities to be inserted.

6 Comments

So does this prevent one query per user? I don't think so. Either way, you have to address the issue in the question. Offering an alternative upsert solution isn't asked for and not enough.
From the question: So my question is, what is the most performance efficient method of inserting these users, ignoring any duplicates?, I think INSERT IGNORE will be the most convenient (just add IGNORE in the raw SQL, for this library it will be like db.Set<Users>().UpsertRange(NewUsers).NoUpdate().Run()) and fastest(no need to try SELECT all existing records from DB) approach.
Please don't add new info in comments. If this library deals with individual upserts it's highly unlikely that it is faster, whichever setting you use. You have to test if it doesn't execute one query per item to check if it exists.
It doesn't have to. The INSERT IGNORE syntax is done by serverside of the database, its design goal is to allow the client easily insert many records that might be duplicated with existing ones into a table without manually distinguishing which record is existing and which isn't.
And I'm currently using this library, within MySQL general log I see statements like INSERT IGNORE INTO table (fields) VALUES (entity1,entity2,entity3).
|
0

The following extension method will allow you to insert records of any type while ignoring duplicates:

 public static void AddRangeIgnore(this DbSet dbSet, IEnumerable<object> entities)
    {
        var entitiesList = entities.ToList();
        var firstEntity = entitiesList.FirstOrDefault();

        if (firstEntity == null || !firstEntity.HasKey() || firstEntity.HasIdentityKey())
        {
            dbSet.AddRange(entitiesList);
            return;
        }

        var uniqueEntities = new List<object>();

        using (var dbContext = _dataService.CreateDbContext())
        {
            var uniqueDbSet = dbContext.Set(entitiesList.First().GetType());

            foreach (object entity in entitiesList)
            {
                var keyValues = entity.GetKeyValues();
                var existingEntity = uniqueDbSet.Find(keyValues);

                if (existingEntity == null)
                {
                    uniqueEntities.Add(entity);
                    uniqueDbSet.Attach(entity);
                }
            }
        }

        dbSet.AddRange(uniqueEntities);
    }

    public static object[] GetKeyValues(this object entity)
    {
        using (var dbContext = _dataService.CreateDbContext())
        {
            var entityType = entity.GetType();
            dbContext.Set(entityType).Attach(entity);
            var objectStateEntry = ((IObjectContextAdapter)dbContext).ObjectContext.ObjectStateManager.GetObjectStateEntry(entity);
            var value = objectStateEntry.EntityKey
                                        .EntityKeyValues
                                        .Select(kv => kv.Value)
                                        .ToArray();
            return value;
        }
    }

    public static bool HasKey(this object entity)
    {
        using (var dbContext = _dataService.CreateDbContext())
        {
            var entityType = entity.GetType();
            dbContext.Set(entityType).Attach(entity);
            var objectStateEntry = ((IObjectContextAdapter)dbContext).ObjectContext.ObjectStateManager.GetObjectStateEntry(entity);
            return objectStateEntry.EntityKey != null;
        }
    }

    public static bool HasIdentityKey(this object entity)
    {
        using (var dbContext = _dataService.CreateDbContext())
        {
            var entityType = entity.GetType();
            dbContext.Set(entityType).Attach(entity);
            var objectStateEntry = ((IObjectContextAdapter)dbContext).ObjectContext.ObjectStateManager.GetObjectStateEntry(entity);
            var keyPropertyName = objectStateEntry.EntityKey
                                        .EntityKeyValues
                                        .Select(kv => kv.Key)
                                        .FirstOrDefault();

            if (keyPropertyName == null)
            {
                return false;
            }

            var keyProperty = entityType.GetProperty(keyPropertyName);
            var attribute = (DatabaseGeneratedAttribute)Attribute.GetCustomAttribute(keyProperty, typeof(DatabaseGeneratedAttribute));
            return attribute != null && attribute.DatabaseGeneratedOption == DatabaseGeneratedOption.Identity;
        }
    }

1 Comment

What is _dataService?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.