6

I have a mongodb collection . When I do.

db.bill.find({})

I get,

{ 
    "_id" : ObjectId("55695ea145e8a960bef8b87a"),
    "name" : "ABC. Net", 
    "code" : "1-98tfv",
    "abbreviation" : "ABC",
    "bill_codes" : [  190215,  44124,  190215,  147708 ],
    "customer_name" : "abc"
}

I need an operation to remove the duplicate values from the bill_codes. Finally it should be

{ 
    "_id" : ObjectId("55695ea145e8a960bef8b87a"),
    "name" : "ABC. Net", 
    "code" : "1-98tfv",
    "abbreviation" : "ABC",
    "bill_codes" : [  190215,  44124,  147708 ],
    "customer_name" : "abc"
}

How to achieve this in mongodb.

4 Answers 4

16

Well's you can do this using the aggregation framework as follows:

collection.aggregate([
    { "$project": {
        "name": 1,
        "code": 1,
        "abbreviation": 1,
        "bill_codes": { "$setUnion": [ "$bill_codes", [] ] }
    }}
])

The $setUnion operator is a "set" operator, therefore to make a "set" then only the "unique" items are kept behind.

If you are still using a MongoDB version older than 2.6 then you would have to do this operation with $unwind and $addToSet instead:

collection.aggregate([
    { "$unwind": "$bill_codes" },
    { "$group": {
        "_id": "$_id",
        "name": { "$first": "$name" },
        "code": { "$first": "$code" },
        "abbreviation": { "$first": "$abbreviation" },
        "bill_codes": { "$addToSet": "$bill_codes" }
    }}
])

It's not as efficient but the operators are supported since version 2.2.

Of course if you actually want to modify your collection documents permanently then you can expand on this and process the updates for each document accordingly. You can retrieve a "cursor" from .aggregate(), but basically following this shell example:

db.collection.aggregate([
    { "$project": {
        "bill_codes": { "$setUnion": [ "$bill_codes", [] ] },
        "same": { "$eq": [
            { "$size": "$bill_codes" },
            { "$size": { "$setUnion": [ "$bill_codes", [] ] } }
        ]}
    }},
    { "$match": { "same": false } }
]).forEach(function(doc) {
    db.collection.update(
        { "_id": doc._id },
        { "$set": { "bill_codes": doc.bill_codes } }
    )
})

A bit more involved for earlier versions:

db.collection.aggregate([
    { "$unwind": "$bill_codes" },
    { "$group": {
        "_id": { 
            "_id": "$_id",
            "bill_code": "$bill_codes"
        },
        "origSize": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id._id",
        "bill_codes": { "$push": "$_id.bill_code" },
        "origSize": { "$sum": "$origSize" },
        "newSize": { "$sum": 1 }
    }},
    { "$project": {
        "bill_codes": 1,
        "same": { "$eq": [ "$origSize", "$newSize" ] }
    }},
    { "$match": { "same": false } }
]).forEach(function(doc) {
    db.collection.update(
        { "_id": doc._id },
        { "$set": { "bill_codes": doc.bill_codes } }
    )
})

With the added operations in there to compare if the "de-duplicated" array is the same as the original array length, and only return those documents that had "duplicates" removed for processing on updates.


Probably should add the "for python" note here as well. If you don't care about "identifying" the documents that contain duplicate array entries and are prepared to "blast" the whole collection with updates, then just use python .set() in the client code to remove the duplicates:

for doc in collection.find():
    collection.update(
       { "_id": doc["_id"] },
       { "$set": { "bill_codes": list(set(doc["bill_codes"])) } }
    )

So that's quite simple and it depends on which is the greater evil, the cost of finding the documents with duplicates or updating every document whether it needs it or not.

This at least covers techniques.

Sign up to request clarification or add additional context in comments.

4 Comments

This does not save back to the collection. I mean doing db.bill.find({}) again retrieves the duplicate value
@user567797 Your question did not state that you wanted to change the stored documents. This was answering in terms of "display only". You would have to process the results and update each document individually where the items are actually changed. Added an explanation of how to do this and identify the documents that had duplicates removed from the array, so you don't need to update every document in the collection.
How can I do this with the second query as I am using mongo version 2.4 .Also note that your update code has a missing comma.
@user567797 Very possible added the way to do this and corrections. MongoDB 2.4 is quite old really and you should considering upgrading at least to 2.6.x ( and you would have to before upgrading further ). There are many advantages to doing this, including "Bulk updates" to make this work even faster.
3

You can use a foreach loop with some javascript:

db.bill.find().forEach(function(entry){
     var arr = entry.bill_codes;
     var uniqueArray = arr.filter(function(elem, pos) {
        return arr.indexOf(elem) == pos;
     }); 
     entry.bill_codes = uniqueArray;
     db.bill.save(entry);
})

Comments

2

MongoDB 4.2 collection updateMany method's update parameter can also be an aggregation pipeline (instead of a document). The pipeline supports $set, $unset and $replaceWith stages. Using the $setIntersection aggregation pipeline operator with the $set stage, you can remove the duplicates from an array field and update the collection in a single operation.

An example:

arrays collection:

{ "_id" : 0, "a" : [ 3, 5, 5, 3 ] }
{ "_id" : 1, "a" : [ 1, 2, 3, 2, 4 ] }

From the mongo shell:

db.arrays.updateMany(
   {  },
   [
      { $set: { a: { $setIntersection: [ "$a", "$a" ] } } }
   ]
)

The updated arrays collection:

{ "_id" : 0, "a" : [ 3, 5 ] }
{ "_id" : 1, "a" : [ 1, 2, 3, 4 ] }

The other update methods, update(), updateOne() and findAndModify() also has this feature.

1 Comment

good answer! I tried with { $set: { a: { $setIntersection: [ "$a" ] } } } and it works perfectly (MongoDB v.4.2.3)
0

Mongo 3.4+ has $addFields aggregation stage, which allows you to avoid explicitly listing all the other fields in $project:

db.bill.aggregate([
    {"$addFields": {
        "bill_codes": {"$setUnion": ["$bill_codes", []]}
    }}
])

Just for reference, here is another (more lengthy) way that uses replaceRoot and also doesn't require listing all possible fields:

db.bill.aggregate([
    {'$unwind': {
        'path': '$bill_codes',
        // output the document even if its list of books is empty
        'preserveNullAndEmptyArrays': true
    }},
    {'$group': {
        '_id': '$_id',
        'bill_codes': {'$addToSet': '$bill_codes'},
        // arbitrary name that doesn't exist on any document
        '_other_fields': {'$first': '$$ROOT'},
    }},
    {
      // the field, in the resulting document, has the value from the last document merged for the field. (c) docs
      // so the new deduped array value will be used
      '$replaceRoot': {'newRoot': {'$mergeObjects': ['$_other_fields', "$$ROOT"]}}
    },
    {'$project': {'_other_fields': 0}}
])    

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.