2

Here is my Document structure:

{
 "_id" : ObjectId("50dcd7ff4de274a2c4a31df0"),
 "seq_name" : "169:D18M6ACXX:1:1111:17898:82486:GTGACA_10",
 "raw_seq" : "TTGACCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCAGTAGTCAACGGGAGTTAGACTTCTCGCACAGTAATAAACAGCCGTGTCCTCGGCTCTCAGGCTGTTCATTTGCAGA",
 "seq_aa" : "LQMNSLRAEDTAVYYCARSLTPVDYWGQGTLVTVSSGQ",
 "cdr3_seq" : "GCGAGAAGTCTAACTCCCGTTGACTAC",
 "cdr3_seq_aa" : "ARSLTPVDY",
 "cdr3_seq_len" : 27,
 "cdr3_seq_aa_len" : 9,
 "vg" : "IGHV3-48*03",
 "dg" : "IGHD3-10*02R",
 "jg" : "IGHJ4*02",
 "donor" : 10
}

I really enjoy MongoDB framework but I'm having trouble with this grouping pipeline and since I can't $out to another collection yet. I can do this multi-grouping pipeline.

db.collection.aggregate({$match:{cdr3_seq_aa_len:{$gt:3}},
   {$group:{_id:$cdr3_seq_aa,other_set:{$addToSet:$cdr3_seq_aa_len}}},
   {$group:{_id:$other_set,sum:{$sum:1}}})

Which gives me how many unique$cdr3_seq_aa's there are grouped by length.

{ id:40, sum:1002031,
  id:41, sum:1949402,....

However The first operation I would like to do is group by donor. So I can first know how many unique cdr3_seq_aa strings there are among each donor. Then I would like to group it by length and count how many strings group with the length.

1 Answer 1

5

If I understand the question correctly, this is what you're looking for. The key concept is that you can construct compound _id's from multiple fields.

db.collection.aggregate(
[
    {$match: {cdr3_seq_aa_len: {$gt: 3}}},
    {$group: 
         {
              _id: {donor: "$donor", cdr3_seq_aa: "$cdr3_seq_aa"},
              donor_cdr3_seq_aa_count: {$sum: 1},
              cdr3_seq_aa_len: {$first: "$cdr3_seq_aa_len"}
         }
    },
    {$group:
         {
             _id: {donor: "$_id.donor", len: "$cdr3_seq_aa_len"},
             num_strings_with_this_length: {$sum: 1},
             total_doc_count_by_length:
                  {$sum: "$donor_cdr3_seq_aa_count"}
         }
    }
])
Sign up to request clarification or add additional context in comments.

3 Comments

Some issues: 1. Need another } at end of {$match 2. Need group operation for cdr3_seq_aa_len in first $group ($first works) 3. Need $cdr3_seq_aa_len in last $group _id, instead of $cdr_seq_aa_len
No problem. You beat me to the answer so I just ran your query through my test data, which made it easy to debug. :)
Thank you. The only thing I needed was to actually not group by donors anymore as I wanted them to be grouped by their length, and then ask for the unique strings. So the workflow would be. First grab all unique strings among donors. Then combine the donors and ask for all total unique strings grouped by length. I just simply had to take out the donor _id grouping and it worked perfectly!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.