Find distinct inner objects in Elasticsearch

Question

We're trying to find distinct inner objects in Elasticsearch. This would be a minimum example for our case. We're stuck with something like the following mapping (changing types or indices or adding new fields wouldn't be a problem, but the structure should remain as it is):

{
  "building": {
    "properties": {
      "street": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "house number": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "city": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "people": {
        "type": "object",
        "store": "yes",
        "index": "not_analyzed",
        "properties": {
          "firstName": {
            "type": "string",
            "store": "yes",
            "index": "not_analyzed"
          },
          "lastName": {
            "type": "string",
            "store": "yes",
            "index": "not_analyzed"
          }
        }
      }
    }
  }
}

Assuming we have this example data:

{
  "buildings": [
    {
      "street": "Baker Street",
      "house number": "221 B",
      "city": "London",
      "people": [
        {
          "firstName": "John",
          "lastName": "Doe"
        },
        {
          "firstName": "Jane",
          "lastName": "Doe"
        }
      ]
    },
    {
      "street": "Baker Street",
      "house number": "5",
      "city": "London",
      "people": [
        {
          "firstName": "John",
          "lastName": "Doe"
        }
      ]
    },
    {
      "street": "Garden Street",
      "house number": "1",
      "city": "London",
      "people": [
        {
          "firstName": "Jane",
          "lastName": "Smith"
        }
      ]
    }
  ]
}

When we query for the street "Baker Street" (and whatever additional options needed), we expect to get the following list:

[
    {
      "firstName": "John",
      "lastName": "Doe"
    },
    {
      "firstName": "Jane",
      "lastName": "Doe"
    }
]

The format does not matter too much, but we should be able to parse the first and last name. Just, as our actual data-set is much larger, we need the entries to be distinct.

We are using Elasticsearch 1.7.

people should be of type nested elastic.co/guide/en/elasticsearch/reference/current/… — Julien C.
– Julien C., Commented Oct 26, 2015 at 15:26

katericata · Accepted Answer · 2017-11-13 00:20:43Z

We finally solved our problem.

Our solution is (as we expected) a pre-calculated people_all field. But instead of using copy_to or transform we're just writing it as we are writing the other fields when importing our data. The field looks as follows:

"people": {
  "type": "nested",
  ..
  "properties": {
    "firstName": {
      "type": "string",
      "store": "yes",
      "index": "not_analyzed"
    },
    "lastName": {
      "type": "string",
      "store": "yes",
      "index": "not_analyzed"
    },
    "people_all": {
      "type": "string",
      "index": "not_analyzed"
    }
  }
}

Please pay attention on the "index": "not_analyzed" at the people_all field. This is important to have complete buckets. If you don't use it, our example will return 3 buckets "john", "jane" and "doe".

After writing this new field we can run an aggragetion as follows:

{
  "size": 0,
  "query": {
    "term": {
      "street": "Baker Street"
    }
  },
  "aggs": {
    "people_distinct": {
      "nested": {
        "path": "people"
      },
      "aggs": {
        "people_all_distinct": {
          "terms": {
            "field": "people.people_all",
            "size": 0
          }
        }
      }
    }
  }
}

And we return the following response:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "people_distinct": {
      "doc_count": 3,
      "people_name_distinct": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "John Doe",
            "doc_count": 2
          },
          {
            "key": "Jane Doe",
            "doc_count": 1
          }
        ]
      }
    }
  }
}

Out of the buckets in the response we are now able to create the distinct people objects.

Please let us know if there is a better way to reach our goal. Parsing the buckets is not an optimal solution and it would be more fancy to have the fields firstName and lastName in each bucket.

ChintanShah25 · Accepted Answer · 2015-10-26 17:24:54Z

1

As suggested in the comment your mapping of people should be of type nested rather than object as it could give unexpected results. You also need to reindex your data after that.

As for the question, You need to aggregate results based on your query.

{
  "query": {
    "term": {
      "street": "Baker Street"
    }
  },
  "aggs": {
    "distinct_people": {
      "terms": {
        "field": "people",
        "size": 1000
      }
    }
  }
}

Please note that I have set size to 1000 inside aggregation, you might have to use bigger number to get all distinct people, ES returns only 10 results by default.

You could set the query size to 0 or use the parameter search_type=count if you are interested only in aggregated buckets. You can read more about aggregations here. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

I hope this helps!Let me know if this does not work out.

answered Oct 26, 2015 at 17:24

ChintanShah25

12.7k3 gold badges45 silver badges44 bronze badges

2 Comments

soniro Over a year ago

Thank you for your quick answer! Unfortunately this doesn't work (we tried with "object" and "nested"). We already tried to copy the people fields into a "people_all" field and use this new field in terms, but this didn't let to the expected result either.

{      "aggs" : {         "people_distinct" : {             "nested" : {                 "path" : "people"             },             "aggs" : {                 "people_name_distinct" : { "terms" : { "field" : "people.people_all" } }             }         }     } }

in this case leads to 3 buckets: jane, john, doe

soniro Over a year ago

After searching for further information, we stumbled upon the not_analyzed field which should solve our problems. If people_all would be not_analyzed the buckets should be "Jane Doe" and "John Doe". But this doesn't seem to work. Is it because it's a nested field? I'll let you know, as far as I find more information. Assumption: This does not work because people_full is an array of strings.

Collectives™ on Stack Overflow

Find distinct inner objects in Elasticsearch

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related