0

I did several aggregations to SUM some values on our installation of ES 1.7.2.

Found the hard way that on some random situations, the doc_count of each aggregation, doesn't match with the SUM of doc_count of the nested level.

"key": 503,
"doc_count": 383778,
"regionid": {...}

So doc_count=383778

If I SUM doc_count of every element of the regionid of the list bellow, I have doc_count=383718

 "key": 503,
 "doc_count": 383778,
 "regionid": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
       {
          "key": 1,
          "doc_count": 303821,
          "ProviderId": {...}
       },
       {
          "key": 27,
          "doc_count": 23834,
          "ProviderId": {...}
       },
       {
          "key": 25,
          "doc_count": 9565,
          "ProviderId": {...}
       },
       {
          "key": 36,
          "doc_count": 8857,
          "ProviderId": {...}
       },
       {
          "key": 14,
          "doc_count": 8222,
          "ProviderId": {...}
       },
       {
          "key": 68,
          "doc_count": 6746,
          "ProviderId": {...}
       },
       {
          "key": 19,
          "doc_count": 4574,
          "ProviderId": {...}
       },
       {
          "key": 28,
          "doc_count": 4164,
          "ProviderId": {...}
       },
       {
          "key": 10,
          "doc_count": 3006,
          "ProviderId": {...}
       },
       {
          "key": 31,
          "doc_count": 2020,
          "ProviderId": {...}
       },
       {
          "key": 21,
          "doc_count": 1410,
          "ProviderId": {...}
       },
       {
          "key": 32,
          "doc_count": 1368,
          "ProviderId": {...}
       },
       {
          "key": 22,
          "doc_count": 1367,
          "ProviderId": {...}
       },
       {
          "key": 8,
          "doc_count": 1010,
          "ProviderId": {...}
       },
       {
          "key": 16,
          "doc_count": 825,
          "ProviderId": {...}
       },
       {
          "key": 35,
          "doc_count": 559,
          "ProviderId": {...}
       },
       {
          "key": 34,
          "doc_count": 517,
          "ProviderId": {...}
       },
       {
          "key": 26,
          "doc_count": 414,
          "ProviderId": {...}
       },
       {
          "key": 18,
          "doc_count": 371,
          "ProviderId": {...}
       },
       {
          "key": 15,
          "doc_count": 362,
          "ProviderId": {...}
       },
       {
          "key": 33,
          "doc_count": 185,
          "ProviderId": {...}
       },
       {
          "key": 9,
          "doc_count": 143,
          "ProviderId": {...}
       },
       {
          "key": 29,
          "doc_count": 102,
          "ProviderId": {...}
       },
       {
          "key": 17,
          "doc_count": 100,
          "ProviderId": {...}
       },
       {
          "key": 30,
          "doc_count": 96,
          "ProviderId": {...}
       },
       {
          "key": 20,
          "doc_count": 80,
          "ProviderId": {...}
       }
    ]
 }
},

Do you guys know why is this happening?

Maybe a bug?

Part of my aggregation:

 {
    "aggs": {
       "Provider": {
          "terms": {
             "field": "Provider"
          },
          "aggs": {
             "Gateway": {
                "terms": {
                   "field": "Gateway"
                },
                "aggs": {
                   "CustomerId": {
                      "terms": {
                         "field": "CustomerId"
                      },
                      "aggs": {
                         "regionid": {
                            "terms": {
                               "field": "regionid"

Any help is appreciated. Thanks

2
  • 1
    Is it possible that 60 of your documents don't have a value for the provider field? Commented Feb 26, 2016 at 4:06
  • Actually this was the problem. A "long" field had an empty value. Thanks! Commented Feb 26, 2016 at 19:32

1 Answer 1

3

Aggregations in ES are not exact, they are an estimate based on the number of records sampled. Given a big enough sample size, that number can be exact, but that has significant performance implications.

You can read more info on "Shard Size" in the ES documentation on shard_size for terms aggregation

The flatter your index (meaning the more buckets the aggregation returns) the more you need to increase the Shard Size. We found that for a flat index in our system a 20x multiplier was a good rule of thumb. So if I'm returning the top 10 records for an aggregation, we use a shard size of 200.

Sign up to request clarification or add additional context in comments.

1 Comment

Awesome. I'll take a look.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.