2

Just want to find the optimal way for aggregation and not sure how I need work with indexing while aggregation. If someone has experience with that and probably can share ideas or experience...

Situation:

  • MondoDB collection with millions of records, let it be some logs (around 3-5 millions per day)
  • All realisation done with Java 7 and Mongo aggregation
  • Log record in Mongo collection looks like this:
     {
          "_id": "",
          "timestamp": "",
          "userId": "",
          "userIp": "",
          "country": "",
          "city": "",
          "applicationName": ""
     }
  • I have different reports based on log data. I need create reports almost by each field and fields combinations even more all aggregation should be done for Daily/Weekly/Monthly

Question: How I should work with indexing? And what the best way of creating reports with such data for your opinion?

3
  • 1
    Add indexes based on the matching/filtering that you want to do. Make sure datatypes like timestamp align nicely with MongoDB aggregation features you want to use (like it will need to be a Date type). The question you asked is too broad and I don't understand what you're asking about "creating reports". Commented Apr 27, 2014 at 18:50
  • @WiredPrairie for exmpl: if I have two filters 1) timestamp + userId 2) timestamp + applicationName + country or search criteria based on this 4 fields. Does I need 4 indexes? And what the best way to index fields? One way to use '.ensureIndex()' and other one to use Spring annotation '@CompoundIndexes'. Does any principal differences between 'ensureIndex()' and '@CompoundIndexes'? Commented Apr 28, 2014 at 7:24
  • The annotation invokes that actual ensureIndex() command upon connection. I would actually hope there is a way of tuning that so this was not necessarily done on every application startup (which would be costly), but it is intended of a way of defining "schema like" operations in your deployed code. Look for a "tuning" optimization for this in the documentation, or otherwise "script" any changes in indexing with your deployment. Commented Apr 28, 2014 at 9:19

1 Answer 1

1

So for index deployment to optimize you want the following indexes created, or otherwise specified with the equivalent @CompoundIndexes annotation on your class:

db.collection.ensureIndex({ 
    "timestamp": 1, "userId": 1
})

db.collection.ensureIndex({
    "timestamp": 1, "applicationName": 1, "country": 1
})

That comes from your comments for intended usage, so 2 indexes are required in total.

Also to mention that you want your "timestamp" values to be BSON Dates, in that way you get the date aggregation operators that are important to your actual queries. Just using the shell JavaScript form here for general reference:

db.collection.aggregate([
    // Using the index that was created
    { "$match": {
        "timestamp": { 
           "$gte": new Date("2014-04-01"), "$lt": new Date("2014-05-01")
        },
        "userId": { "$gte": "lowervalue", "$lte: "uppervalue" }
    }},

    // Grouping Data
    { "$group": {
        "_id": {
            "y": { "$year": "$timestamp" },
            "m": { "$month": "$timestamp" },
            "d": { "$day": "$timestamp" }
        },
        "someField": { "$sum": "$someField" },
        "otherField": { "$avg": "$otherField" }
    }}
])

So it is the "date aggregation operators" that allow you to split that BSON date into the components that you want (in this case day) so that all the timestamp values contained within those boundaries are subject to the other aggregation operations on the other fields that you have.

Please note that the indexes can only ever be used in the initial $match stage of the aggregation pipeline, so this is importantly where you select your data and reduce your working set. But if you do things this way then you will be getting the maximum performance possible from your data.

For further gains, consider "pre-aggregating" information in other collections, based on periodically running the base forms of aggregation over the raw "log" data that you have.

Sign up to request clarification or add additional context in comments.

1 Comment

Good. Thats good hint and place for more surf around this topic.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.