4

I have the following mapping for an aggregation field:

"language" : {
    "type" : "string",
    "index": "analyzed",
    "analyzer" : "standard"
}

The value of a sample document in this property may look like: "en zh_CN"

This property has no other use except aggregation. I notice that when I get aggregation results on this property:

{
  "query": {
        "filtered" : {
            "query": { 
                    "match_all": {}
            },
            "filter" : {
                 ...
            }
        }
    },
    "aggregations": {
        "facets": {
            "terms": {
                "field": "language"
            }
        }
    }   
}

The bucket key values are in lower case.

  "aggregations" : {
    "facets" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "zh_cn",
        "doc_count" : 2
      }, {
        "key" : "en",
        "doc_count" : 1
      } ]
    }
  }

How can I achieve my aggregation goal without letting ES to lowers the case of its values. I feel that I may need to change the mapping for this property, but not sure how.

Thanks and regards.

1 Answer 1

8

Try this in your mapping instead:

"language" : {
    "type" : "string",
    "index": "not_analyzed"
}

The text in that field of each document will be used, unmodified, to create tokens, and those tokens will be returned by your terms aggregation. For the example value you provided, the aggregation will return it verbatim:

"aggregations": {
   "facets": {
      "buckets": [
         {
            "key": "en zh_CN",
            "doc_count": 1
         }
      ]
   }
}

If you still want the text to be tokenized on whitespace, you can try using the whitespace analyzer in your mapping:

"language": {
   "type": "string",
   "analyzer": "whitespace"
}

Then your aggregation will return:

"aggregations": {
   "facets": {
      "buckets": [
         {
            "key": "en",
            "doc_count": 1
         },
         {
            "key": "zh_CN",
            "doc_count": 1
         }
      ]
   }
}

Here is the code I used to test both examples:

http://sense.qbox.io/gist/a7b3c7d50c7012537c50d576d03940b28b5f8793

Sign up to request clarification or add additional context in comments.

2 Comments

Sloan, thanks for your input! Your mapping is not going to work for me, because value "en zh_CN" actually has two elements "en" and "zh_CN" and they should be two bucket keys. With your mapping, I got bucket key such as "en zh_CN" in aggregation results.
Yeah, I just added another example to my answer that might work for you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.