You were right in looking towards $cond for this, but the syntax was a little the wrong way around and there are some other helpers you need here as well:
var SID1 = ['559de1b2aa43f47656b2a3fa','559de1b2aa43f47656b2a3f9'],
SID2 = ['559de1b2aa43f'];
db.variants.aggregate([
{ "$unwind": "$samples" },
{ "$group": {
"_id": {
"chr": "$chr",
"pos": "$pos",
"ref": "$ref",
"alt": "$alt"
},
"SID1": {
"$sum": {
"$cond": [
{ "$setIsSubset": [
{ "$map": {
"input": { "$literal": ["A"] },
"as": "el",
"in": "$samples.sample_id"
}},
SID1
]},
"$samples.GTC",
0
]
}
},
"SID2": {
"$sum": {
"$cond": [
{ "$setIsSubset": [
{ "$map": {
"input": { "$literal": ["A"] },
"as": "el",
"in": "$samples.sample_id"
}},
SID2
]},
"$samples.GTC",
0
]
}
}
}}
])
And that gives the result:
{
"_id" : {
"chr" : "20",
"pos" : 14371,
"ref" : "A",
"alt" : "G"
},
"SID1" : 1,
"SID2" : 0
}
{
"_id" : {
"chr" : "22",
"pos" : 14373,
"ref" : "C",
"alt" : "T"
},
"SID1" : 1,
"SID2" : 2
}
So the $cond goes "inside" the $sum since that is an "accumulator" and therefore how you you structure under $group.
There is nothing wrong with using a variable name directly when defining a pipeline, as the value will just "interpolate" and be treated as a literal. But of course, since these are "arrays" to neet to compare them as such. More to the point, they are actually "sets".
The $setIsSubset operator is the one that can "logically" compare two "sets" in order to see if one contains the elements of the other. That gives a logical true/false for the $cond to work with.
However the "samples.sample_id" field is not an array. But we can simply "make it into one" by using the $map operator feeding it a $literal array declared as a single element and transpose the value.
The $map operator does just the same thing as the function of the same name in many programming languages, where it acts on an array as it's "input". It processes each array element as a declared variable from "as" by processing a functional expression from "in". It returns an array of the same length as the input, but with results as applied by the functional expression. As another example:
{ "$map": {
"input": { "$literal": [1,2,3,4] }, // input array
"as": "el", // variable represents element
"in": {
"$multiply": [ "$$el", "$$el" ] // square of element
}
}
Returns:
[1,4,9,16] // All array elements "squared"
The $literal operator has actually been around since MongoDB 2.2 with the introduction of the aggregation framework, but was the undocumented operator $const. Whilst it was mentioned earlier that there is nothing wrong with "injecting" an external variable into the aggregation pipeline as is shown, the one thing you cannot do is "return" that value as a property of a document. As an expression argument this is fine in most cases, but for instance you cannot do this:
{ "$project": {
"myfield": ["bill","ted","fred"]
}}
Which would cause an error, so instead you do:
{ "$project": {
"myfield": { "$literal": ["bill","ted","fred"] }
}}
Which allows the field to be set as what you want it to be, an array of values.
Therefore in combination with $map in the listing it is just a way of representing an array of a single element that does not exist in the pipeline in order to "tranpose" it's value with the the current field.
It turns this:
"559de1b2aa43f47656b2a3fa"
Via the code:
{ "$map": {
"input": { "$literal": ["A"] },
"as": "el",
"in": "$samples.sample_id" // into this ["559de1b2aa43f47656b2a3fa"]
}}
This makes the $setIsSubset operation look like this internally:
{ "$setIsSubset": [
["559de1b2aa43f47656b2a3fa"],
["559de1b2aa43f47656b2a3fa","559de1b2aa43f47656b2a3f9"]
}} // true
The end result is each variable gets compared to see if the value contained matches one of their elements, and the appropriate "field value" is sent to $sum for accumulation.
Also, drop the $project as this generally is taken care of by the $group stage and leaving it there causes overhead in processing by needing to cycle through every document in the pipeline first. So it isn't really optimizing anything, but costing you instead.
BTW. Your sample data from pipeline output so far is missing a closing "brace". I used this data below ( without the $unwind in the pipeline of course )
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 34, 1 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"} },
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1|0", "GQ" : 15, "DP" : 8, "HQ" : [ 5, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}},
{ "chr" : "22", "pos" : 14373, "ref" : "C", "alt" : "T", "samples" : { "GT" : "1/1", "GQ" : 43, "DP" : 5, "HQ" : [ 0, 2 ], "GTC" : 2, "sample_id" : "559de1b2aa43f"}},
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "0|0", "GQ" : 48, "DP" : 1, "HQ" : [ 51, 51 ], "GTC" : 0, "sample_id" : "559de1b2aa43f47656b2a3fa"}},
{ "chr" : "20", "pos" : 14371, "ref" : "A", "alt" : "G", "samples" : { "GT" : "1|0", "GQ" : 48, "DP" : 8, "HQ" : [ 51, 51 ], "GTC" : 1, "sample_id" : "559de1b2aa43f47656b2a3f9"}}