5

I need to compute a pipeline aggregation in ElasticSearch and I can't figure out how to express it.

Each document has an email address and an amount. I need to output range buckets of amount counts, grouped by unique email.

{ "0 - 99": 300, "100 - 400": 100 ...}

Would basically be the expected output (the keys would be transformed in my application code), indicating that 300 unique emails have cumulatively received at least 99 (amount) across all documents.

Intuitively, I would expect a query like below. However, range does not appear to be a buckets aggregation (or allow buckets_path).

What is the correct approach here?

{
 aggs: {
   users: {
     terms: {
       field: "email"
     },
     aggs: {
       amount_received: {
         sum: {
           field: "amount"
         }
       }
     }
   },
   amount_ranges: {
     range: {
       buckets_path: "users>amount_received",
       ranges: [
           { to: 99.0 },
           { from: 100.0, to: 299.0 },
           { from: 300.0, to: 599.0 },
           { from: 600.0 }
       ]
     }
   }
}
  }

1 Answer 1

8
+250

There's no pipeline aggregation that does that directly. However, I think I came up with a solution that should suit your needs, it goes like this. The idea is to repeat the same terms/sum aggregation and then use a bucket_selector pipeline aggregation for each of the ranges you're interested in.

POST index/_search
{
  "size": 0,
  "aggs": {
    "users_99": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "-99": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived < 100"
          }
        }
      }
    },
    "users_100_299": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "100-299": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived >= 100 && params.amountReceived < 300"
          }
        }
      }
    },
    "users_300_599": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "300-599": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived >= 300 && params.amountReceived < 600"
          }
        }
      }
    },
    "users_600": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "600": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived >= 600"
          }
        }
      }
    }
  }
}

In the results, the number of buckets in the users_99 will be the number of unique emails that have an amount less than 99. Similarly, users_100_299 will contain as many buckets as there are unique emails with amounts between 100 and 300. And so on...

Sign up to request clarification or add additional context in comments.

11 Comments

Probably the only solution that uses ES entirely (no outside ES steps). I do have some yet-to-be-proved concerns regarding the performance of this bundle of aggregations. BUT, if however is performing, @Ben you are happy with it, then there are no concerns :-). If the performance is affecting your ES usage, maybe consider to do the "split" outside Elasticsearch.
Agreed @Andrei, performance might be a concern depending on the amount of data Ben wants to run this query on. We'll see what he says. Besides, it would be nice to create a new bucket_range pipeline aggregation, I'll probably file a feature request soon.
Val, I did consider a solution like this but was hoping there was a more built-in approach. I will definitely give this a try and see if the performance hit is acceptable. Thanks!
@Ben were you able to try this out?
@Val, I was able to try it out. It would be great if I could just derive the document count of each bucket, as opposed returning an array of docs for each. As you point out, pulling all of the records (beyond the arbitrary 1k size you placed) for each bucket would likely create a performance issue. I suppose deriving that count is ultimately not possible.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.