1

I'm inserting a lot of documents in bulk with the latest node.js native driver (2.0).

My collection has an index on the URL field, and I'm bound to get duplicates out of the thousands of lines I insert. Is there a way for MongoDB to not crash when it encounters a duplicate?

Right now I'm batching records 1000 at a time, and Using insertMany. I've tried various things, including adding {continueOnError=true}. I tried inserting my records one by one, but it's just too slow, I have thousands of workers in a queue and can't really afford the delay.

Collection definition :

self.prods = db.collection('products');
self.prods.ensureIndex({url:1},{unique:true}, function() {});

Insert :

MongoProcessor.prototype._batchInsert= function(coll,items){
    var self = this;
    if(items.length>0){
        var batch = [];
        var l = items.length;
        for (var i = 0; i < 999; i++) {
            if(i<l){
                batch.push(items.shift());
            }
            if(i===998){
                coll.insertMany(batch, {continueOnError: true},function(err,res){
                    if(err) console.log(err);
                    if(res) console.log('Inserted products: '+res.insertedCount+' / '+batch.length);
                    self._batchInsert(coll,items);
                });
            }
        }
    }else{
        self._terminate();
    }
};

I was thinking of dropping the index before the insert, then reindexing using dropDups, but it seems a bit hacky, my workers are clustered and I have no idea what would happen if they try to insert records while another process is reindexing... Does anyone have a better idea?

Edit :

I forgot to mention one thing. The items I insert have a 'processed' field which is set to 'false'. However the items already in the db may have been processed, so the field can be 'true'. Therefore I can't upsert... Or can I select a field to be untouched by upsert?

2
  • I think you're looking for batch upserts. Commented Oct 31, 2014 at 6:58
  • thay's the problem, I can't upsert, the items already in the collection have a field 'processed' which can be true or false, whereas the ones I insert will always be 'false' Commented Oct 31, 2014 at 11:18

1 Answer 1

2

The 2.6 Bulk API is what you're looking for, which will require MongoDB 2.6+* and node driver 1.4+.

There are 2 types of bulk operations:

  1. Ordered bulk operations. These operations execute all the operation in order and error out on the first write error.
  2. Unordered bulk operations. These operations execute all the operations in parallel and aggregates up all the errors. Unordered bulk operations do not guarantee order of execution.

So in your case Unordered is what you want. The previous link provides an example:

MongoClient.connect("mongodb://localhost:27017/test", function(err, db) {
// Get the collection
var col = db.collection('batch_write_ordered_ops');
// Initialize the Ordered Batch
var batch = col.initializeUnorderedBulkOp();

// Add some operations to be executed in order
batch.insert({a:1});
batch.find({a:1}).updateOne({$set: {b:1}});
batch.find({a:2}).upsert().updateOne({$set: {b:2}});
batch.insert({a:3});
batch.find({a:3}).remove({a:3});

// Execute the operations
batch.execute(function(err, result) {
  console.dir(err);
  console.dir(result);
  db.close();
  });
});

*The docs do state that: "for older servers than 2.6 the API will downconvert the operations. However it’s not possible to downconvert 100% so there might be slight edge cases where it cannot correctly report the right numbers."

Sign up to request clarification or add additional context in comments.

4 Comments

Interesting, especially the aggregation of errors. I'll give it a try, thanks! How about performance? I insert 10k records at a time on average.
It will probably perform similarly to the way you were doing batches before. Optimal batch size is going to depend on document size and your specific environment, so I'd suggest you experiment with it. Could be anywhere from 100 to 5000 per batch.
That did the trick wonderfully! I haven't played around with the batch size yet, there's no real need for it, I'm inserting anywhere from 100 to 35k records without any problems or noticeable slowdown. Quick question not worthy of opening a new thread, do you know where is the doc for the err and res object? I can't find it anywhere...
BulkWriteResult should be something like this: docs.mongodb.org/manual/reference/method/BulkWriteResult. Here's the docs for 2.0: mongodb.github.io/node-mongodb-native/2.0/api/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.