There's no solution that I know of that allows you to do this in on shot. However, there's a way to do it in two steps, without having to iterate over several batches of hashes.
The idea is to first identify all the hashes to be updated using a feature called Transforms, which is nothing else than a feature that leverages aggregations and builds a new index out of the aggregation results.
Once that new index has been created by your transform, you can use it as a terms lookup mechanism to run your update by query and update the isDupe boolean for all documents having a matching hash.
So, first, we want to create a transform that will create a new index featuring documents containing all duplicate hashes that need to be updated. This is achieved using a scripted_metric aggregation whose job is to identify all hashes occurring at least twice and for which isDupe: false. We're also aggregating by week, so for each week, there's going to be a document containing all duplicates hashes for that week.
PUT _transform/dup-transform
{
"source": {
"index": "test-index",
"query": {
"term": {
"isDupe": "false"
}
}
},
"dest": {
"index": "test-dups",
"pipeline": "set-id"
},
"pivot": {
"group_by": {
"week": {
"date_histogram": {
"field": "lastModifiedDate",
"calendar_interval": "week"
}
}
},
"aggregations": {
"dups": {
"scripted_metric": {
"init_script": """
state.week = -1;
state.hashes = [:];
""",
"map_script": """
// gather all hashes from each shard and count them
def hash = doc['identificationHash.keyword'].value;
// set week
state.week = doc['lastModifiedDate'].value.get(IsoFields.WEEK_OF_WEEK_BASED_YEAR).toString();
// initialize hashes
if (!state.hashes.containsKey(hash)) {
state.hashes[hash] = 0;
}
// increment hash
state.hashes[hash] += 1;
""",
"combine_script": "return state",
"reduce_script": """
def hashes = [:];
def week = -1;
// group the hash counts from each shard and add them up
for (state in states) {
if (state == null) return null;
week = state.week;
for (hash in state.hashes.keySet()) {
if (!hashes.containsKey(hash)) {
hashes[hash] = 0;
}
hashes[hash] += state.hashes[hash];
}
}
// only return the hashes occurring at least twice
return [
'week': week,
'hashes': hashes.keySet().stream().filter(hash -> hashes[hash] >= 2)
.collect(Collectors.toList())
]
"""
}
}
}
}
}
Before running the transform, we need to create the set-id pipeline (referenced in the dest section of the transform) that will define the ID of the target document that is going to contain the hashes so that we can reference it in the terms query for updating documents:
PUT _ingest/pipeline/set-id
{
"processors": [
{
"set": {
"field": "_id",
"value": "{{dups.week}}"
}
}
]
}
We're now ready to start the transform to generate the list of hashes to update and it's as simple as running this:
POST _transform/dup-transform/_start
When it has run, the destination index test-dups will contain one document that looks like this:
{
"_index" : "test-dups",
"_type" : "_doc",
"_id" : "44",
"_score" : 1.0,
"_source" : {
"week" : "2021-11-01T00:00:00.000Z",
"dups" : {
"week" : "44",
"hashes" : [
"12345"
]
}
}
},
Finally, we can run the update by query as follows (add as many terms queries as weekly documents in the target index):
POST test/_update_by_query
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "44",
"path": "dups.hashes"
}
}
},
{
"terms": {
"identificationHash": {
"index": "test-dups",
"id": "45",
"path": "dups.hashes"
}
}
}
]
}
},
"script": {
"source": "ctx._source.isDupe = true;"
}
}
That's it in two simple steps!! Try it out and let me know.