1

In short: I want to lookup for distinct values in some field of the document BUT only matching some filter. The problem is in array-fields. Imagine there are following documents in ES 2.4:

[
  {
    "states": [
      "Washington (US-WA)",
      "California (US-CA)"
    ]
  },
  {
    "states": [
      "Washington (US-WA)"
    ]
  }
]

I'd like my users to be able to lookup all possible states via typeahead, so I have the following query for the "wa" user request:

{
  "query": {
    "wildcard": {
      "states.raw": "*wa*"
    }
  },
  "aggregations": {
    "typed": {
      "terms": {
        "field": "states.raw"
      },
      "aggregations": {
        "typed_hits": {
          "top_hits": {
            "_source": { "includes": ["states"] }
          }
        }
      }
    }
  }
}

states.raw is a sub-field with not_analyzed option

This query works pretty well unless I have an array of values like in the example - it returns both Washington and California. I do understand why it happens (query and aggregations are working on top of the document and the document contains both, even though only one option matched the filter), but I really want to only see Washington and don't want to add another layer of filtering on the application side for the ES results.

Is there a way to do so via single ES 2.4 request?

2 Answers 2

1
+50

You could use the "Filtering Values" feature (see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2). So, your request could look like:

POST /index/collection/_search?size=0
{
  "aggregations": {
    "typed": {
      "terms": {
        "field": "states.raw",
        "include": ".*wa.*" // You need to carefully quote the "wa" string because it'll be used as part of RegExp
      },
      "aggregations": {
        "typed_hits": {
          "top_hits": {
            "_source": { "includes": ["states"] }
          }
        }
      }
    }
  }
}
Sign up to request clarification or add additional context in comments.

1 Comment

I've been looking at include, though didn't use it properly and missed the chance to find the answer myself, thanks
1

I can't hold myself back, though, and not tell you that using wildcard with leading wildcard is not the best solution. Do, please please, consider using ngrams for this:

PUT states
{
  "settings": {
    "analysis": {
      "filter": {
        "ngrams": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "20"
        }
      },
      "analyzer": {
        "ngram_analyzer": {
          "type": "custom",
          "filter": [
            "standard",
            "lowercase",
            "ngrams"
          ],
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "location": {
          "properties": {
            "states": {
              "type": "string",
              "fields": {
                "raw": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ngrams": {
                  "type": "string",
                  "analyzer": "ngram_analyzer"
                }
              }
            }
          }
        }
      }
    }
  }
}


POST states/doc/1
{
  "text":"bla1",
  "location": [
    {
      "states": [
        "Washington (US-WA)",
        "California (US-CA)"
      ]
    },
    {
      "states": [
        "Washington (US-WA)"
      ]
    }
  ]
}
POST states/doc/2
{
  "text":"bla2",
  "location": [
    {
      "states": [
        "Washington (US-WA)",
        "California (US-CA)"
      ]
    }
  ]
}
POST states/doc/3
{
  "text":"bla3",
  "location": [
    {
      "states": [
        "California (US-CA)"
      ]
    },
    {
      "states": [
        "Illinois (US-IL)"
      ]
    }
  ]
}

And the final query:

GET states/_search
{
  "query": {
    "term": {
      "location.states.ngrams": {
        "value": "sh"
      }
    }
  },
  "aggregations": {
    "filtering_states": {
      "terms": {
        "field": "location.states.raw",
        "include": ".*sh.*"
      },
      "aggs": {
        "typed_hits": {
          "top_hits": {
            "_source": {
              "includes": [
                "location.states"
              ]
            }
          }
        }
      }
    }
  }
}

3 Comments

Thanks for ngrams, but our requirements do need it to work this way and it's sad, I know. Can't mark your answer as the right one because there's the same query from @igelbox provided a bit earlier, that'd be unfair, sorry.
Hehe, no worries about the answer. I've heard about the restriction on using something as is and cannot change, before. And it's unfortunate, because those users will make the necessary changes only when they hit a performance issue, usually after the amount of data in the cluster increases or the number of requests increases. And when this happens, the environment might already be affected seriously. And at that point, a change in mapping will have a bigger impact on the cluster overall.
That's totally fair and you should know I did everything from my side to change that, but no luck. Maybe, when we'll hit the performance issues, we'll change it. And it'll be a good lesson for everyone :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.