4

We're trying to find distinct inner objects in Elasticsearch. This would be a minimum example for our case. We're stuck with something like the following mapping (changing types or indices or adding new fields wouldn't be a problem, but the structure should remain as it is):

{
  "building": {
    "properties": {
      "street": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "house number": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "city": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "people": {
        "type": "object",
        "store": "yes",
        "index": "not_analyzed",
        "properties": {
          "firstName": {
            "type": "string",
            "store": "yes",
            "index": "not_analyzed"
          },
          "lastName": {
            "type": "string",
            "store": "yes",
            "index": "not_analyzed"
          }
        }
      }
    }
  }
}

Assuming we have this example data:

{
  "buildings": [
    {
      "street": "Baker Street",
      "house number": "221 B",
      "city": "London",
      "people": [
        {
          "firstName": "John",
          "lastName": "Doe"
        },
        {
          "firstName": "Jane",
          "lastName": "Doe"
        }
      ]
    },
    {
      "street": "Baker Street",
      "house number": "5",
      "city": "London",
      "people": [
        {
          "firstName": "John",
          "lastName": "Doe"
        }
      ]
    },
    {
      "street": "Garden Street",
      "house number": "1",
      "city": "London",
      "people": [
        {
          "firstName": "Jane",
          "lastName": "Smith"
        }
      ]
    }
  ]
}

When we query for the street "Baker Street" (and whatever additional options needed), we expect to get the following list:

[
    {
      "firstName": "John",
      "lastName": "Doe"
    },
    {
      "firstName": "Jane",
      "lastName": "Doe"
    }
]

The format does not matter too much, but we should be able to parse the first and last name. Just, as our actual data-set is much larger, we need the entries to be distinct.

We are using Elasticsearch 1.7.

1

2 Answers 2

4

We finally solved our problem.

Our solution is (as we expected) a pre-calculated people_all field. But instead of using copy_to or transform we're just writing it as we are writing the other fields when importing our data. The field looks as follows:

"people": {
  "type": "nested",
  ..
  "properties": {
    "firstName": {
      "type": "string",
      "store": "yes",
      "index": "not_analyzed"
    },
    "lastName": {
      "type": "string",
      "store": "yes",
      "index": "not_analyzed"
    },
    "people_all": {
      "type": "string",
      "index": "not_analyzed"
    }
  }
}

Please pay attention on the "index": "not_analyzed" at the people_all field. This is important to have complete buckets. If you don't use it, our example will return 3 buckets "john", "jane" and "doe".

After writing this new field we can run an aggragetion as follows:

{
  "size": 0,
  "query": {
    "term": {
      "street": "Baker Street"
    }
  },
  "aggs": {
    "people_distinct": {
      "nested": {
        "path": "people"
      },
      "aggs": {
        "people_all_distinct": {
          "terms": {
            "field": "people.people_all",
            "size": 0
          }
        }
      }
    }
  }
}

And we return the following response:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "people_distinct": {
      "doc_count": 3,
      "people_name_distinct": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
          {
            "key": "John Doe",
            "doc_count": 2
          },
          {
            "key": "Jane Doe",
            "doc_count": 1
          }
        ]
      }
    }
  }
}

Out of the buckets in the response we are now able to create the distinct people objects.

Please let us know if there is a better way to reach our goal. Parsing the buckets is not an optimal solution and it would be more fancy to have the fields firstName and lastName in each bucket.

Sign up to request clarification or add additional context in comments.

Comments

1

As suggested in the comment your mapping of people should be of type nested rather than object as it could give unexpected results. You also need to reindex your data after that.

As for the question, You need to aggregate results based on your query.

{
  "query": {
    "term": {
      "street": "Baker Street"
    }
  },
  "aggs": {
    "distinct_people": {
      "terms": {
        "field": "people",
        "size": 1000
      }
    }
  }
}

Please note that I have set size to 1000 inside aggregation, you might have to use bigger number to get all distinct people, ES returns only 10 results by default.

You could set the query size to 0 or use the parameter search_type=count if you are interested only in aggregated buckets. You can read more about aggregations here. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html

I hope this helps!Let me know if this does not work out.

2 Comments

Thank you for your quick answer! Unfortunately this doesn't work (we tried with "object" and "nested"). We already tried to copy the people fields into a "people_all" field and use this new field in terms, but this didn't let to the expected result either. { "aggs" : { "people_distinct" : { "nested" : { "path" : "people" }, "aggs" : { "people_name_distinct" : { "terms" : { "field" : "people.people_all" } } } } } } in this case leads to 3 buckets: jane, john, doe
After searching for further information, we stumbled upon the not_analyzed field which should solve our problems. If people_all would be not_analyzed the buckets should be "Jane Doe" and "John Doe". But this doesn't seem to work. Is it because it's a nested field? I'll let you know, as far as I find more information. Assumption: This does not work because people_full is an array of strings.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.