I'm inserting a lot of documents in bulk with the latest node.js native driver (2.0).
My collection has an index on the URL field, and I'm bound to get duplicates out of the thousands of lines I insert. Is there a way for MongoDB to not crash when it encounters a duplicate?
Right now I'm batching records 1000 at a time, and Using insertMany. I've tried various things, including adding {continueOnError=true}. I tried inserting my records one by one, but it's just too slow, I have thousands of workers in a queue and can't really afford the delay.
Collection definition :
self.prods = db.collection('products');
self.prods.ensureIndex({url:1},{unique:true}, function() {});
Insert :
MongoProcessor.prototype._batchInsert= function(coll,items){
var self = this;
if(items.length>0){
var batch = [];
var l = items.length;
for (var i = 0; i < 999; i++) {
if(i<l){
batch.push(items.shift());
}
if(i===998){
coll.insertMany(batch, {continueOnError: true},function(err,res){
if(err) console.log(err);
if(res) console.log('Inserted products: '+res.insertedCount+' / '+batch.length);
self._batchInsert(coll,items);
});
}
}
}else{
self._terminate();
}
};
I was thinking of dropping the index before the insert, then reindexing using dropDups, but it seems a bit hacky, my workers are clustered and I have no idea what would happen if they try to insert records while another process is reindexing... Does anyone have a better idea?
Edit :
I forgot to mention one thing. The items I insert have a 'processed' field which is set to 'false'. However the items already in the db may have been processed, so the field can be 'true'. Therefore I can't upsert... Or can I select a field to be untouched by upsert?