2

I have serious trouble finding anything useful in Mongo documentation about dealing with embedded documents. Let's say I have a following schema:

{
  _id: ObjectId,
  ...
  data: [
    {
      _childId: ObjectId // let's use custom name so we can distinguish them
      ...
    }
  ] 
}
  1. What's the most efficient way to remove everything inside data for particular _id?

  2. What's the most efficient way to remove embedded document with particular _childId inside given _id? What's the performance here, can _childId be indexed in order to achieve logarithmic (or similar) complexity instead of linear lookup? If so, how?

  3. What's the most efficient way to insert a lot of (let's say a 1000) documents into data for given _id? And like above, can we get O(n log n) or similar complexity with proper indexing?

  4. What's the most efficient way to get the count of documents inside data for given _id?

1
  • 2
    It works best if you only ask one question at a time. Commented Aug 23, 2014 at 16:54

3 Answers 3

4

The other two answers give sensible advice on your questions 1-4, but I want to address your question by interrogating the basis for asking it in the first place. The terminology of "embedded document" in the context of MongoDB storing "documents" confuses people. You should not think of an embedded document as another document in MongoDB that you search for, index, or update as its own document, because that's not what it is. It's a grouped collection of fields inside a document; it's a BSON field of type Object. To quote the embedded document docs,

Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations.

Starting from knowledge about your use case, you should pick your documents and document structure to make your common operations easier. If you are so concerned about 1-4, you probably want to unwind your data array of childIds into separate documents. A concrete example of this common "antipattern" is a blog with many authors - you could have a user document with a large, changing array of posts embedded inside, or a post document with user information replicated in each. I can't say for sure what is or isn't wrong with your data model as you've given no specific details about it, but struggling to understand why 1-4 seem hard or undocumented or slow in MongoDB is a good sign that you should rethink the data model so the equivalent of 1-4 are fun and easy! Or at least easier and more fun.

Sign up to request clarification or add additional context in comments.

Comments

3

I can't find anything on speed so I will go with the ways found in the documentation in the hope that they made the most efficient ways the one they documented:

  1. If you want to remove all subdocuments in data you can just update data to []

  2. The official way to remove a document with a specific _childId from data would be $pull:

    db.collection.update(
        { },
        { $pull: { data: { _childId: id } } },
    )
    

    might need to add { multi: true } if _childId is not unique (multipart subdocuments)

    On indexing on subdocuments I would refer you to this question. Short answer yes you can index fields in subdocuments for faster lookup just like you would index normal fields by

    db.collection.ensureIndex({"data._childId" : 1})
    

    If you want to search for a subdocument in only one specific document you can use aggregation i.e.

    db.collection.aggregate({$match:{_id : _id},
                            {$unwind:'$data'},
                            {$match:{data._childId: _childID})
    

    which will first match for _id and only then for _childId. It will return the parent document with data only containing the subdocument(s) with _childId.

  3. There is $push for that although for 1000 subdocument you might not want to do it in one query anyways

4 Comments

In 1) I've asked about _id not _childId. As for 2) the index you mentioned is on collection level, not document level so a find() query like { _id: 123, data._childId: 120 } will also scan entries with _id other than 123 (although it won't return them). Thanks anyway.
@Sebastian Edited accordingly
Thanks, but the .aggregate method either won't use the index or will still have to scan non-matching _id.
That is right but if you query for single specific subdocuments often using references might be a better strategy than embedding them. You could then query for the (sub)documents by their index (maybe by a combined index of _id and _parentId) and build the parent documents with aggregate when need them
1
  1. Trudbert is right: db.collection.update({_id:yourId},{$set:{data:[]}})
  2. Two points for Trudbert. However, I would like to add that if you have the whole document available in your app, it might be reasonable to simply replace the contents of the whole document if suitable for your use case.
  3. I have made good experience with bulk updates performance wise. You might want to try it.
  4. I don't know how you come to the idea that an aggregate wouldn't use indices, but since _id is unique, it would make much more sense to use db.collection.findOne({_id:yourId},{"data._childId":1,_id:0}).data.length or use it's equivalent as a raw command in the driver of choice. Since the connection is already established, unless the array is very big, it should be faster to simply return the data instead of having the calculations done on a possibly (over)loaded server.

As per your comments to Trudberts answer: _id is unique. So exactly one doc will need to be modified for a known _id: db.collection.update({_id:theId},{$pull..... It does not get more efficient. For an unknown id, create an index on childId and do the same pull operation with a match on childId instead of id with the multi option set to remove all references to a specific childId.

I strongly second Trudberts suggestion of using the aggregation framework to create documents when needed out of optimized data. Currently, I have an aggregation pipeline which analyses 5M records with more than 7 million relations to each other in some 6 seconds. On a non sharded standalone instance. With spinning disks, crappy IO and not even optimized. With careful planning the aggregations (an early match limiting the documents passed to the ones not processed so far) and merging them with earlier results (adapt the _id in the group phase can achieve that), you can even optimize this for some mere fractions of seconds, if absolutely necessary.

6 Comments

What is your available strategy for use it's equivalent as a raw command in the driver of choice if the array is very big?
As I said: unless it is very big. I just tested it with some 400k entries for the array, pretty close to the maximum BSON size of 16MB ( {b:[{c:1,text:"foo"},...,{c:400000,text:"foo"}}]}. The command takes 2.5 seconds on average. In this case, using an aggregation pipeline is faster. Actually, one has to find out which command shows better performance depending on the use case. Personally, I'd say that creating a statistics table using an aggregation pipeline is the way I'd take personally.
How do you implement that command, by calling db.eval()? It's beyond my understanding that mongodb doesn't store the size of array at the head of BSON structure of array. If it does, then it's very easy to get the size of any array. Do you know why?
I don't know what language you are using, but it is documented in the respective drivers API docs. As for the size in the array header: mongod would have to calculate the size of the array with every write operation done to the array. As the array elements may vastly differ in size, mongod would basically have to parse the complete array every time it writes to it, which would result in quite inefficient writes. What it can do now is to lookup \x04FieldName within the document, find the next nullbyte, and prepend the data, either within the padding or by doing a relocation. Pretty efficient.
The language I'm using is Java. Thanks a lot for the detailed explanations. Perhaps I used an improper word "size". I meant the count of elements in array against the question 4). As we know that saving a BSON array in this mode: <byteSize><\0x04><fieldName><\0x00>[elements]. <byteSize> has to be modified if the array changed. Then, why not change the mode like: <byteSize><\0x04><elementCount><fieldName><\0x00>[elements]. The count of elements of array can be got immediately without extra calculation even for 400K entries you've tried. Any thoughts?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.