3

I need to delete a large number of documents in a 5.5 Elasticsearch cluster. I know the optimal way to do this is to rebuild the cluster without the intended documents, but that's not possible in our case. I run the following query that deletes documents from a subset of the indexes in the cluster:

GET myindex_1*/doc_type/_delete_by_query
{
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "typeCode": [
              "Filtered_Type"
            ]
          }
        }
      ],
      "must": [
        {
          "range": {
            "createdDateUTC": {
              "lt": "2017-10-28"
            }
          }
        }
      ]
    }
  }
}

It starts deleting documents for a couple of hours but then just stops and I have to kick it off again. Any ideas why it stops running the delete query?

Just a note, I'm using Kibana to run the query and the request times out on the client side when though I can see it continues deleting on the backend.

1
  • isn't it because of timeout? Could you try POST instead of GET? Commented Oct 31, 2019 at 17:54

2 Answers 2

1

From here:

By default _delete_by_query uses scroll batches of 1000. You can change the batch size with the scroll_size URL parameter:

POST twitter/_delete_by_query?scroll_size=5000
{
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}

You can find more information here about batching and batch sizes here:

And since you'll need to scroll through one to many batches to delete all of the documents found by your query, you can find more information about scrolling here:

Sign up to request clarification or add additional context in comments.

2 Comments

would a default batch size of 1000 cause it to stop deleting documents eventually? And increasing it to 5000 prevent this from happening?
I updated my answer. Simply increasing the batch size to 5000 isn't sufficient. You need to think in terms of scrolling through all the batches and deleting all of the documents in each batch.
0

The Delete by Query API can halt if it runs into conflicting versions of a document. This can happen if a document was updated after the delete by query started but before it reached the document (Elastic documentation).

If you're running the deletion asynchronously, you can fetch the task details after it completes to see if there were any failures (Task API docs).

You can also specify the conflicts=proceed query parameter which will not halt the deletion if a conflict is detected. I'm not sure if that conflicting doc will still be deleted though.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.