Distinct values from array-field matching filter in Elasticsearch 2.4

Question

In short: I want to lookup for distinct values in some field of the document BUT only matching some filter. The problem is in array-fields. Imagine there are following documents in ES 2.4:

[
  {
    "states": [
      "Washington (US-WA)",
      "California (US-CA)"
    ]
  },
  {
    "states": [
      "Washington (US-WA)"
    ]
  }
]

I'd like my users to be able to lookup all possible states via typeahead, so I have the following query for the "wa" user request:

{
  "query": {
    "wildcard": {
      "states.raw": "*wa*"
    }
  },
  "aggregations": {
    "typed": {
      "terms": {
        "field": "states.raw"
      },
      "aggregations": {
        "typed_hits": {
          "top_hits": {
            "_source": { "includes": ["states"] }
          }
        }
      }
    }
  }
}

states.raw is a sub-field with not_analyzed option

This query works pretty well unless I have an array of values like in the example - it returns both Washington and California. I do understand why it happens (query and aggregations are working on top of the document and the document contains both, even though only one option matched the filter), but I really want to only see Washington and don't want to add another layer of filtering on the application side for the ES results.

Is there a way to do so via single ES 2.4 request?

igelbox · Accepted Answer · 2018-07-23 11:59:44Z

1

+50

You could use the "Filtering Values" feature (see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2). So, your request could look like:

POST /index/collection/_search?size=0
{
  "aggregations": {
    "typed": {
      "terms": {
        "field": "states.raw",
        "include": ".*wa.*" // You need to carefully quote the "wa" string because it'll be used as part of RegExp
      },
      "aggregations": {
        "typed_hits": {
          "top_hits": {
            "_source": { "includes": ["states"] }
          }
        }
      }
    }
  }
}

answered Jul 23, 2018 at 11:59

igelbox

862 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

dazewell Over a year ago

I've been looking at include, though didn't use it properly and missed the chance to find the answer myself, thanks

Andrei Stefan · Accepted Answer · 2018-07-23 13:04:10Z

1

I can't hold myself back, though, and not tell you that using wildcard with leading wildcard is not the best solution. Do, please please, consider using ngrams for this:

PUT states
{
  "settings": {
    "analysis": {
      "filter": {
        "ngrams": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "20"
        }
      },
      "analyzer": {
        "ngram_analyzer": {
          "type": "custom",
          "filter": [
            "standard",
            "lowercase",
            "ngrams"
          ],
          "tokenizer": "standard"
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "location": {
          "properties": {
            "states": {
              "type": "string",
              "fields": {
                "raw": {
                  "type": "string",
                  "index": "not_analyzed"
                },
                "ngrams": {
                  "type": "string",
                  "analyzer": "ngram_analyzer"
                }
              }
            }
          }
        }
      }
    }
  }
}


POST states/doc/1
{
  "text":"bla1",
  "location": [
    {
      "states": [
        "Washington (US-WA)",
        "California (US-CA)"
      ]
    },
    {
      "states": [
        "Washington (US-WA)"
      ]
    }
  ]
}
POST states/doc/2
{
  "text":"bla2",
  "location": [
    {
      "states": [
        "Washington (US-WA)",
        "California (US-CA)"
      ]
    }
  ]
}
POST states/doc/3
{
  "text":"bla3",
  "location": [
    {
      "states": [
        "California (US-CA)"
      ]
    },
    {
      "states": [
        "Illinois (US-IL)"
      ]
    }
  ]
}

And the final query:

GET states/_search
{
  "query": {
    "term": {
      "location.states.ngrams": {
        "value": "sh"
      }
    }
  },
  "aggregations": {
    "filtering_states": {
      "terms": {
        "field": "location.states.raw",
        "include": ".*sh.*"
      },
      "aggs": {
        "typed_hits": {
          "top_hits": {
            "_source": {
              "includes": [
                "location.states"
              ]
            }
          }
        }
      }
    }
  }
}

answered Jul 23, 2018 at 13:04

Andrei Stefan

52.5k6 gold badges102 silver badges92 bronze badges

3 Comments

dazewell Over a year ago

Thanks for ngrams, but our requirements do need it to work this way and it's sad, I know. Can't mark your answer as the right one because there's the same query from @igelbox provided a bit earlier, that'd be unfair, sorry.

Andrei Stefan Over a year ago

Hehe, no worries about the answer. I've heard about the restriction on using something as is and cannot change, before. And it's unfortunate, because those users will make the necessary changes only when they hit a performance issue, usually after the amount of data in the cluster increases or the number of requests increases. And when this happens, the environment might already be affected seriously. And at that point, a change in mapping will have a bigger impact on the cluster overall.

dazewell Over a year ago

That's totally fair and you should know I did everything from my side to change that, but no luck. Maybe, when we'll hit the performance issues, we'll change it. And it'll be a good lesson for everyone :)

Collectives™ on Stack Overflow

Distinct values from array-field matching filter in Elasticsearch 2.4

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related