Avoid duplicate documents in Elasticsearch

Question

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.

Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.

I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.

I could compare each item to the children of my target-parent and add new documents, if there is no equal child.

I wondered if there is a way, to let elasticsearch handle duplicates.

If the IDs are different every time, then it's not possible. Elasticsearch doesn't handle duplicates. — Andrei Stefan
– Andrei Stefan, Commented Nov 6, 2015 at 20:36
you could either use some primary key from db or some hashing mechanism to generate unique id for given document, If you post documents without specifying _id, ES will generate unique _id for each document regardless of content inside it — ChintanShah25
– ChintanShah25, Commented Nov 6, 2015 at 21:53

Vineeth Mohan · Accepted Answer · 2015-11-07 02:29:21Z

7

Duplication needs to be handled in ID handling itself. Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.

If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.

You can read more about this approach here.

answered Nov 7, 2015 at 2:29

Vineeth Mohan

19.4k9 gold badges70 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alex · Accepted Answer · 2018-02-28 16:08:42Z

0

When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

answered Feb 28, 2018 at 16:08

Alex

437 bronze badges

Collectives™ on Stack Overflow

Avoid duplicate documents in Elasticsearch

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related