1

For instance, given this result for a search, reduced to a size of 3 hits for brevity:

{
  "hits": {
    "total": {
      "value": 51812937,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "desc-imunizacao",
        "_type": "_doc",
        "_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0",
        "_score": 1.0,
        "_source": {
          "vacina_descricao_dose": "    2ª Dose",
          "estabelecimento_uf": "BA",
          "document_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0"
        }
      },
      {
        "_index": "desc-imunizacao",
        "_type": "_doc",
        "_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0",
        "_score": 1.0,
        "_source": {
          "vacina_descricao_dose": "    1ª Dose",
          "estabelecimento_uf": "SE",
          "document_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0"
        }
      },
      {
        "_index": "desc-imunizacao",
        "_type": "_doc",
        "_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0",
        "_score": 1.0,
        "_source": {
          "vacina_descricao_dose": "    1ª Dose",
          "estabelecimento_uf": "SE",
          "document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
        }
      }
    ]
  }
}

If I wanted to query for "estabelecimento_uf": "SE" and keep only one result for duplicates of "document_id", I would issue:

{
  "_source": ["document_id", "estabelecimento_uf", "vacina_descricao_dose"],
  "query": {
    "match": {
      "estabelecimento_uf": {
        "query": "SE"
      }
    }
  },
    "collapse": {
    "field": "document_id",
    "inner_hits": {
    "name": "latest",
      "size": 1
    }
  }
}

Is there a way to achieve this with Elasticsearch's Count API? Meaning: count query for field A (estabelecimento_uf) and count for unique values of field B (document_id), knowing that document_id has duplicates of course.


This is a public API: https://imunizacao-es.saude.gov.br/_search

This is the authentication:

User: imunizacao_public Pass: qlto5t&7r_@+#Tlstigi

2
  • 1
    Maybe use aggs? Commented May 14, 2021 at 2:48
  • Yes, maybe something like Cardinality aggregation, I'll try it. I was expecting to use Count API for this... Commented May 14, 2021 at 4:16

1 Answer 1

1

You can use a combination of filter aggregation along with cardinality aggregation, to get a count of unique document id based on a filter

{
  "size": 0,
  "aggs": {
    "filter_agg": {
      "filter": {
        "term": {
          "estabelecimento_uf.keyword": "SE"
        }
      },
      "aggs": {
        "count_docid": {
          "cardinality": {
            "field": "document_id.keyword"
          }
        }
      }
    }
  }
}

As far as I know, you cannot get the count of distinct field values using count API, you can either use field collapsing feature (as done in the question) OR use cardinality aggregation


Adding a working example with index data, search query and search result

{
  "vacina_descricao_dose": "    2ª Dose",
  "estabelecimento_uf": "BA",
  "document_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0"
}
{
  "vacina_descricao_dose": "    1ª Dose",
  "estabelecimento_uf": "SE",
  "document_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0"
}
{
  "vacina_descricao_dose": "    1ª Dose",
  "estabelecimento_uf": "SE",
  "document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
{
  "vacina_descricao_dose": "    1ª Dose",
  "estabelecimento_uf": "SE",
  "document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}

Search Query 1:

{
  "size": 0,
  "query": {
    "match": {
      "estabelecimento_uf": "SE"
    }
  },
  "aggs": {
    "count_doc_id": {
      "cardinality": {
        "field": "document_id.keyword"
      }
    }
  }
}

Search Result:

"aggregations": {
    "count_doc_id": {
      "value": 2            // note this
    }
  }

Search Query 2:

{
  "size": 0,
  "aggs": {
    "filter_agg": {
      "filter": {
        "term": {
          "estabelecimento_uf.keyword": "SE"
        }
      },
      "aggs": {
        "count_docid": {
          "cardinality": {
            "field": "document_id.keyword"
          }
        }
      }
    }
  }
}

Search Result:

"aggregations": {
    "filter_agg": {
      "doc_count": 3,
      "count_docid": {
        "value": 2         // note this
      }
    }
  }
Sign up to request clarification or add additional context in comments.

8 Comments

This is what end up doing it, minus the first level aggs, using instead "query match" for estabelecimento_id. Figured out it wasn't possible with Count API. I forgot to include my answer – thanks.
I compared my query's result with yours, and when doing a query with (this minified version): {"size":0,"query":{"match":{"estabelecimento_uf":"SE"}},"aggs":{"count_doc_id":{"cardinality":{"field":"document_id"}}}}, I get document count of value 612571, against 613904 from your query. Care to help me understand the difference? Useful: JSON Beautify tool. Note: there's no .keyword here.
Actually, I think this isn't one of the answers (the answer you have provided). If I remove the cardinality aggregation, it returns the same result. And I'm certain that there are duplicates fordocument_id for a number of states (estabelecimento_uf), so I think my query works as expected, I'll post it shortly, and hopefully someone will point out the differences or mistakes.
@RickStanley if you look at this part of documentation --> elastic.co/guide/en/elasticsearch/reference/current/…, you will find that the query will work in the same way as your minified version
I think it's safe and fair to say that this is the right answer, regardless of my data. But this .keyword property left me wondering about it's usage. I'm no Elasticsearch expert - just getting started.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.