Range ElasticSearch Aggregation

Question

I need to compute a pipeline aggregation in ElasticSearch and I can't figure out how to express it.

Each document has an email address and an amount. I need to output range buckets of amount counts, grouped by unique email.

{ "0 - 99": 300, "100 - 400": 100 ...}

Would basically be the expected output (the keys would be transformed in my application code), indicating that 300 unique emails have cumulatively received at least 99 (amount) across all documents.

Intuitively, I would expect a query like below. However, range does not appear to be a buckets aggregation (or allow buckets_path).

What is the correct approach here?

{
 aggs: {
   users: {
     terms: {
       field: "email"
     },
     aggs: {
       amount_received: {
         sum: {
           field: "amount"
         }
       }
     }
   },
   amount_ranges: {
     range: {
       buckets_path: "users>amount_received",
       ranges: [
           { to: 99.0 },
           { from: 100.0, to: 299.0 },
           { from: 300.0, to: 599.0 },
           { from: 600.0 }
       ]
     }
   }
}
  }

Val · Accepted Answer · 2018-06-22 05:41:59Z

8

+250

There's no pipeline aggregation that does that directly. However, I think I came up with a solution that should suit your needs, it goes like this. The idea is to repeat the same terms/sum aggregation and then use a bucket_selector pipeline aggregation for each of the ranges you're interested in.

POST index/_search
{
  "size": 0,
  "aggs": {
    "users_99": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "-99": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived < 100"
          }
        }
      }
    },
    "users_100_299": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "100-299": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived >= 100 && params.amountReceived < 300"
          }
        }
      }
    },
    "users_300_599": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "300-599": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived >= 300 && params.amountReceived < 600"
          }
        }
      }
    },
    "users_600": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "600": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived >= 600"
          }
        }
      }
    }
  }
}

In the results, the number of buckets in the users_99 will be the number of unique emails that have an amount less than 99. Similarly, users_100_299 will contain as many buckets as there are unique emails with amounts between 100 and 300. And so on...

edited Jun 22, 2018 at 5:41

answered Jun 22, 2018 at 5:35

Val

218k14 gold badges377 silver badges384 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Andrei Stefan Over a year ago

Probably the only solution that uses ES entirely (no outside ES steps). I do have some yet-to-be-proved concerns regarding the performance of this bundle of aggregations. BUT, if however is performing, @Ben you are happy with it, then there are no concerns :-). If the performance is affecting your ES usage, maybe consider to do the "split" outside Elasticsearch.

Val Over a year ago

Agreed @Andrei, performance might be a concern depending on the amount of data Ben wants to run this query on. We'll see what he says. Besides, it would be nice to create a new bucket_range pipeline aggregation, I'll probably file a feature request soon.

Ben Over a year ago

Val, I did consider a solution like this but was hoping there was a more built-in approach. I will definitely give this a try and see if the performance hit is acceptable. Thanks!

Val Over a year ago

@Ben were you able to try this out?

Ben Over a year ago

@Val, I was able to try it out. It would be great if I could just derive the document count of each bucket, as opposed returning an array of docs for each. As you point out, pulling all of the records (beyond the arbitrary 1k size you placed) for each bucket would likely create a performance issue. I suppose deriving that count is ultimately not possible.

|

Collectives™ on Stack Overflow

Range ElasticSearch Aggregation

1 Answer 1

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related