3

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.

Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.

I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.

I could compare each item to the children of my target-parent and add new documents, if there is no equal child.

I wondered if there is a way, to let elasticsearch handle duplicates.

2
  • If the IDs are different every time, then it's not possible. Elasticsearch doesn't handle duplicates. Commented Nov 6, 2015 at 20:36
  • you could either use some primary key from db or some hashing mechanism to generate unique id for given document, If you post documents without specifying _id, ES will generate unique _id for each document regardless of content inside it Commented Nov 6, 2015 at 21:53

2 Answers 2

7

Duplication needs to be handled in ID handling itself. Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.

If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.

You can read more about this approach here.

Sign up to request clarification or add additional context in comments.

Comments

0

When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.