0

I am trying to remove duplicates from MongoDB but all solutions find fail. My JSON structure:

{
    "_id" : ObjectId("5d94ad15667591cf569e6aa4"),
    "a" : "aaa",
    "b" : "bbb",
    "c" : "ccc",
    "d" : "ddd",
    "key" : "057cea2fc37aabd4a59462d3fd28c93b"

}

Key value is md5(a+b+c+d). I already have a database with over 1 billion records and I want to remove all the duplicates according to key and after use unique index so if the key is already in data base the record wont insert again.

I already tried

db.data.ensureIndex( { key:1 }, { unique:true, dropDups:true } )

But for what I understand dropDups were removed in MongoDB > 3.0.

I tried also several of java script codes like:

var duplicates = [];

db.data.aggregate([
  { $match: { 
    key: { "$ne": '' }  // discard selection criteria
  }},
  { $group: { 
    _id: { key: "$key"}, // can be grouped on multiple properties 
    dups: { "$addToSet": "$_id" }, 
    count: { "$sum": 1 } 
  }}, 
  { $match: { 
    count: { "$gt": 1 }    // Duplicates considered as count greater than one
  }}
],
{allowDiskUse: true}       // For faster processing if set is larger
).forEach(function(doc) {
    doc.dups.shift();      // First element skipped for deleting
    doc.dups.forEach( function(dupId){ 
        duplicates.push(dupId);   // Getting all duplicate ids
        }
    )    
})

and it fails with:

QUERY [Js] uncaught exception: Error: command failed: {
“ok“: 0,
“errmsg“ : “assertion src/mongo/db/pipeline/value.cpp:1365“.
“code“ : 8,
“codeName" : “UnknownError“
} : aggregate failed

I haven't change MongoDB settings, working with the default settings.

4
  • You are looking to get all the duplicates documents with key field, and put the corresponding _ids in an array; is that all in that query? Commented Oct 16, 2019 at 10:09
  • If your collection has the following documents: { "_id" : 1, "k" : 11 }, { "_id" : 2, "k" : 22 }, { "_id" : 3, "k" : 11 }, { "_id" : 4, "k" : 44 }, { "_id" : 5, "k" : 55 }, { "_id" : 6, "k" : 66 }, { "_id" : 7, "k" : 22 }, { "_id" : 8, "k" : 88 }, { "_id" : 9, "k" : 11 } . The resulting query output is like: { "resultArr" : [ 2, 3, 1 ] } . Commented Oct 16, 2019 at 11:12
  • @prasad_ I want "resultArr" to be: [ { "_id" : 1, "k" : 11 }, { "_id" : 2, "k" : 22 }, { "_id" : 4, "k" : 44 }, { "_id" : 5, "k" : 55 }, { "_id" : 6, "k" : 66 }, { "_id" : 8, "k" : 88 }] all the duplicates will be removed. Commented Oct 16, 2019 at 11:40
  • I think we can get that, with all the duplicates removed. I will post the query in the answer, and lets see if meets your requirement. Commented Oct 16, 2019 at 13:15

1 Answer 1

0

This is my input collection dups, with some duplicate data (k with values 11 and 22):

{ "_id" : 1, "k" : 11 }
{ "_id" : 2, "k" : 22 }
{ "_id" : 3, "k" : 11 }
{ "_id" : 4, "k" : 44 }
{ "_id" : 5, "k" : 55 }
{ "_id" : 6, "k" : 66 }
{ "_id" : 7, "k" : 22 }
{ "_id" : 8, "k" : 88 }
{ "_id" : 9, "k" : 11 }

The query removes the duplicates:

db.dups.aggregate([
  { $group: { 
        _id: "$k",
        dups: { "$addToSet": "$_id" }, 
        count: { "$sum": 1 } 
  }}, 
  { $project: { k: "$_id", _id: { $arrayElemAt: [ "$dups", 0 ] } } }
] )
=>
{ "k" : 88, "_id" : 8 }
{ "k" : 22, "_id" : 7 }
{ "k" : 44, "_id" : 4 }
{ "k" : 55, "_id" : 5 }
{ "k" : 66, "_id" : 6 }
{ "k" : 11, "_id" : 9 }

As you see the following duplicate data is removed:

{ "_id" : 1, "k" : 11 }
{ "_id" : 2, "k" : 22 }
{ "_id" : 3, "k" : 11 }


Get the results in an array:

var arr = db.dups.aggregate([ ...] ).toArray()

The arr has the array of the documents:

[
        {
                "k" : 88,
                "_id" : 8
        },
        {
                "k" : 22,
                "_id" : 7
        },
        {
                "k" : 44,
                "_id" : 4
        },
        {
                "k" : 55,
                "_id" : 5
        },
        {
                "k" : 66,
                "_id" : 6
        },
        {
                "k" : 11,
                "_id" : 9
        }
]
Sign up to request clarification or add additional context in comments.

5 Comments

Failed with: "errormsg": "assertion src/mongo/db/pipeline/value.cpp:1365", "code": 8, "codeName": "UnknownError"
The example and code I had posted has failed? It is not clear what has failed. Please clarify your comment.
Yes, the example you gave had failed with the error (the same I posted): QUERY [Js] uncaught exception: Error: command failed: { “ok“: 0, “errmsg“ : “assertion src/mongo/db/pipeline/value.cpp:1365“. “code“ : 8, “codeName" : “UnknownError“ } : aggregate failed
I just now ran the scripts in mongo shell. There is no error. I am using MongoDB Server version 4.0.5.
The code I had provided is shown with a test data of 9 documents. I guess you have to plan and figure how to work with a billion documents.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.