Group by term and get list of array keys and their count

Question

in elasticsearch i got several hundred thousand documents with roughly this kind of structure:

{
  "script": "/index.html",
  "query": {
    "ab": "hello",
    "cd": "world",
    "ef": "123"
}

The url "http://localhost/index.html?ab=hello&cd=world&ef=123" is parsed into it. "script" only contains the path and the target script - no query at all. The query array does not contain the same list of keys and of course different values, which doesn't matter at the moment at all.

I know, i am able to get a distinct list of "script" with:

{
  "aggregations": {
    "my_agg": {
      "terms": {
        "field": "script.raw"
      }
    }
  }
}

which results into multiple buckets like

"buckets": [
{
    "key": "/index.html",
    "doc_count": 123456
},
{
    "key": "/hello.html",
    "doc_count": 1456
},
...

My question: Is there a way to get additionally a list and count of all query keys, which are occurring in the different urls?

Something like:

"buckets": [
{
    "key": "/index.html",
    "doc_count": 123456,
    "query_key_count": {
      "ab": 33456,
      "cd": 3456,
      "ef": 456,
      "gh": 56,
      "ij": 6
    }
},
{
    "key": "/hello.html",
    "doc_count": 1456,
    "query_key_count": {
      "zy": 156,
      "gh": 6
    }
},
...

Thanks alot!!

You mean, the query_key_count actually contains the number of occurrences of its keys among all items in your data. Say if you have 10 total objects, with 2 objects having "ab" in their query object, then you want the result to be query_key_count:{"ab":2 ... so on so forth}? — Syed Mauze Rehan
– Syed Mauze Rehan, Commented Mar 23, 2015 at 14:44
This should help you >>> stackoverflow.com/questions/26743204/… — Syed Mauze Rehan
– Syed Mauze Rehan, Commented Mar 23, 2015 at 14:57
Yes, if i have a index.html-doc with the params "ab" and "cd" and another index.html-doc with the params "cd" and "ef" with random values, i like to get a "query_key_count":{"cd": 2, "ab": 1, "ef": 1}. Thanks alot for the link - i will have a look! — Taris
– Taris, Commented Mar 23, 2015 at 15:08

Sloan Ahrens · Accepted Answer · 2015-03-23 19:02:48Z

To leverage Elasticsearch's strengths, you really need your documents to be structured something like this:

{
   "script": "/index.html",
   "query": [
      {
         "query_key": "ab",
         "query_val": "hello"
      },
      {
         "query_key": "cd",
         "query_val": "world"
      },
      {
         "query_key": "ef",
         "query_val": "123"
      }
   ]
}

If I set up a mapping with a nested type:

PUT /test_index
{
   "mappings": {
      "doc": {
         "properties": {
            "query": {
               "type": "nested",
               "properties": {
                  "query_key": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "query_val": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            },
            "script": {
               "type": "string",
               "index": "not_analyzed"
            }
         }
      }
   }
}

and add a couple of docs:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"script": "/index.html","query": [{"query_key":"ab", "query_val":"hello"},{"query_key":"cd", "query_val":"world"}, {"query_key":"ef", "query_val":"123"}]}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"script": "/index.html","query": [{"query_key":"ab", "query_val":"foo"},{"query_key":"cd", "query_val":"bar"}, {"query_key":"gh", "query_val":"456"}]}

I can get back query keys in a nested terms aggregation:

POST /test_index/_search?search_type=count
{
   "aggs": {
      "resellers": {
         "nested": {
            "path": "query"
         },
         "aggs": {
            "query_keys": {
               "terms": {
                  "field": "query.query_key"
               }
            }
         }
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "resellers": {
         "doc_count": 6,
         "query_keys": {
            "buckets": [
               {
                  "key": "ab",
                  "doc_count": 2
               },
               {
                  "key": "cd",
                  "doc_count": 2
               },
               {
                  "key": "ef",
                  "doc_count": 1
               },
               {
                  "key": "gh",
                  "doc_count": 1
               }
            ]
         }
      }
   }
}

Here's the code I used:

http://sense.qbox.io/gist/aecd92e5903f644e28c802860a90a86bdd7f97ee

That did it - thanks a million! In addition to my question your request is missing the first grouping by the script itself. I changed your request to: { "aggs": { "group_by_script": { "terms": { "field": "script" }, "aggs": { "query_count": { "nested": { "path": "query" }, "aggs": { "query_keys": { "terms": { "field": "query.query_key" } } } } } } } } Now it is perfectly working :)

Collectives™ on Stack Overflow

Group by term and get list of array keys and their count

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related