Matching a field to external array values with Aggregation

Question

I have a query that looks like this so far:

 db.variants.aggregate({'$unwind':'$samples'},{$project:{"_id":0,"samples":1,"chr":1,"pos":1,"ref":1,"alt":1}});
    { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 34, 1 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}
    { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1|0", "GQ" : 15, "DP" : 8, "HQ" : [ 5, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}
    { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1/1", "GQ" : 43, "DP" : 5, "HQ" : [ 0, 2 ], "GTC" : 2, "sample_id" : "559de1b2aa43f"}
    { "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 51, 51 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}
    { "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "1|0", "GQ" : 48, "DP" : 8, "HQ" : [ 51, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}

But outside the query, I have defined samples groups like this:

 SID1=['559de1b2aa43f47656b2a3fa','559de1b2aa43f47656b2a3f9']
 SID2=['559de1b2aa43f']

But I'm stuck when it comes to how to do the actual grouping because my sample_id arrays are not actually in the document, so the normal $group operator won't work. I want to create groups called SID1 and SID2 that sums the 'samples.GTC' if the samples.study_id is in the SID1 array (and the same for SID2). My expected output is:

 { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", SID1:1, SID2:2}
 { "chr" : "22", "pos" : 14371, "ref" : "A", "alt" : "G", SID1:1, SID2:0}

I'm guessing it should be somewhat close to this, but obviously not quite:

db.variants.aggregate(
 {'$unwind':'$samples'},
 {$project:
   {"_id":0,"chr":1,"pos":1,"ref":1,"alt":1,"samples":1}
 },
 { "$group":{
    "_id":{"chr":"$chr","pos":"$pos","ref":"$ref","alt":"$alt"},
    "SID1" : {$cond:{if:{"$sample_id":{$in:SID1}},then:{$sum:"$GTC"},else:0}},
    "SID2" : {$cond:{if:{"$sample_id":{$in:SID2}},then:{$sum:"$GTC"},else:0}},
    }
 });

Blakes Seven · Accepted Answer · 2015-07-23 23:24:02Z

You were right in looking towards $cond for this, but the syntax was a little the wrong way around and there are some other helpers you need here as well:

var SID1 = ['559de1b2aa43f47656b2a3fa','559de1b2aa43f47656b2a3f9'],
    SID2 = ['559de1b2aa43f'];

db.variants.aggregate([
    { "$unwind": "$samples" },
    { "$group": {
        "_id": {
           "chr": "$chr",
           "pos": "$pos",
           "ref": "$ref",
           "alt": "$alt"
        },
        "SID1": {
            "$sum": {
                "$cond": [
                    { "$setIsSubset": [
                        { "$map": { 
                            "input": { "$literal": ["A"] },
                            "as": "el",
                            "in": "$samples.sample_id"
                        }},
                        SID1
                    ]},
                    "$samples.GTC",
                    0
                ]
            }
        },
        "SID2": {
            "$sum": {
                "$cond": [
                    { "$setIsSubset": [
                        { "$map": { 
                            "input": { "$literal": ["A"] },
                            "as": "el",
                            "in": "$samples.sample_id"
                        }},
                        SID2
                    ]},
                    "$samples.GTC",
                    0
                ]
            }
        }
    }}
])

And that gives the result:

{
    "_id" : {
            "chr" : "20",
            "pos" : 14371,
            "ref" : "A",
            "alt" : "G"
    },
    "SID1" : 1,
    "SID2" : 0
}
{
    "_id" : {
            "chr" : "22",
            "pos" : 14373,
            "ref" : "C",
            "alt" : "T"
    },
    "SID1" : 1,
    "SID2" : 2
}

So the $cond goes "inside" the $sum since that is an "accumulator" and therefore how you you structure under $group.

There is nothing wrong with using a variable name directly when defining a pipeline, as the value will just "interpolate" and be treated as a literal. But of course, since these are "arrays" to neet to compare them as such. More to the point, they are actually "sets".

The $setIsSubset operator is the one that can "logically" compare two "sets" in order to see if one contains the elements of the other. That gives a logical true/false for the $cond to work with.

However the "samples.sample_id" field is not an array. But we can simply "make it into one" by using the $map operator feeding it a $literal array declared as a single element and transpose the value.

The $map operator does just the same thing as the function of the same name in many programming languages, where it acts on an array as it's "input". It processes each array element as a declared variable from "as" by processing a functional expression from "in". It returns an array of the same length as the input, but with results as applied by the functional expression. As another example:

{ "$map": {
    "input": { "$literal": [1,2,3,4] },   // input array
    "as": "el",                           // variable represents element
    "in": {                               
        "$multiply": [ "$$el", "$$el" ]   // square of element
    }
}

Returns:

[1,4,9,16]                                // All array elements "squared"

The $literal operator has actually been around since MongoDB 2.2 with the introduction of the aggregation framework, but was the undocumented operator $const. Whilst it was mentioned earlier that there is nothing wrong with "injecting" an external variable into the aggregation pipeline as is shown, the one thing you cannot do is "return" that value as a property of a document. As an expression argument this is fine in most cases, but for instance you cannot do this:

{ "$project": {
    "myfield": ["bill","ted","fred"]
}}

Which would cause an error, so instead you do:

{ "$project": {
    "myfield": { "$literal": ["bill","ted","fred"] }
}}

Which allows the field to be set as what you want it to be, an array of values.

Therefore in combination with $map in the listing it is just a way of representing an array of a single element that does not exist in the pipeline in order to "tranpose" it's value with the the current field.

It turns this:

"559de1b2aa43f47656b2a3fa"

Via the code:

{ "$map": {
    "input": { "$literal": ["A"] },
    "as": "el",
    "in": "$samples.sample_id"           // into this ["559de1b2aa43f47656b2a3fa"]
}}

This makes the $setIsSubset operation look like this internally:

{ "$setIsSubset": [ 
    ["559de1b2aa43f47656b2a3fa"],
    ["559de1b2aa43f47656b2a3fa","559de1b2aa43f47656b2a3f9"]
}}                                                          // true

The end result is each variable gets compared to see if the value contained matches one of their elements, and the appropriate "field value" is sent to $sum for accumulation.

Also, drop the $project as this generally is taken care of by the $group stage and leaving it there causes overhead in processing by needing to cycle through every document in the pipeline first. So it isn't really optimizing anything, but costing you instead.

BTW. Your sample data from pipeline output so far is missing a closing "brace". I used this data below ( without the $unwind in the pipeline of course )

 { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 34, 1 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"} },
 { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1|0", "GQ" : 15, "DP" : 8, "HQ" : [ 5, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}},
 { "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1/1", "GQ" : 43, "DP" : 5, "HQ" : [ 0, 2 ], "GTC" : 2, "sample_id" : "559de1b2aa43f"}},
 { "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 51, 51 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}},
 { "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "1|0", "GQ" : 48, "DP" : 8, "HQ" : [ 51, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}}

I was unaware of $literal and $setIsSubset, so I'll have to look into those a bit more. Would you mind elaborating more on the ["A"] and "el" pieces in there? The mongodb docs haven't helped my try to understand your response. Are they just dummy placeholders so that the expression concludes?
@StevenHart Added an expanded descriptions that hopefully adds to what the manual pages that are linked on each operator has to say.

Collectives™ on Stack Overflow

Matching a field to external array values with Aggregation

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related