Python ElasticSearch Query with Tons of Duplicate Documents

Ask Question

Asked 7 years, 10 months ago

Modified 7 years, 10 months ago

Viewed 756 times

Please note that removing or preventing duplicates is not an option. I have asked if we can do this, and the answer emphatically is no, we have to work around the fact that there are tons of duplicates. Please do not recommend solutions which require removal or updating of any documents, this solution has been rejected by management. I am specifically not "allowed" to implement a solution which prevents the duplicates in the first place, I have to live with the duplicates.

Please go easy on me as I have never even heard of ElasticSearch before and I have done a lot of Googling but nothing seems to do what I want.

I have an ES index with tons and tons of exact duplicates. The duplicate documents are all exactly the same, down to the millisecond on the timestamp, they are identical.

Like this, in this case you can assume author and title are both keywords and Timestamp is a string:

{ "author" : "Kafka, Franz", "title": "The Trial", "id": "1", "Timestamp" : "12-22-05T01:01:05.0000Z" }

{ "author" : "Kafka, Franz", "title": "The Trial", "id": "1", "Timestamp" : "12-22-05T01:01:05:0000Z" } 

{ "author" : "Kafka, Franz", "title": "The Trial", "id": "1", "Timestamp" : "12-22-05T01:01:05:0000Z" }

... with 100 rows exactly identical to this. And some rows with the same content but different timestamps:

{ "author" : "Kafka, Franz", "title": "The Trial", "id": "1", "Timestamp" : "12-23-05T10:10:0005Z }

..and also some rows which have the same content and timestamp but some other field, like ID for example, is different:

{ "author" : "Kafka, Franz", "title": "The Trial", "id": "2", "Timestamp" : "12-22-05T01:01:05.0000Z" }

I need to query these documents such that the result is all of those documents which match my query AND are unique, that there are no EXACT duplicates in the result. So an expected result with the records above would have just three hits, the result would be like this:

{ "author" : "Kafka, Franz", "title": "The Trial", "Timestamp" : "12-22-05T01:01:05:0000Z" } 

{ "author" : "Kafka, Franz", "title": "The Trial", "Timestamp" : "12-23-05T10:10:0005Z }

{ "author" : "Kafka, Franz", "title": "The Trial", "id": "2", "Timestamp" : "12-22-05T01:01:05.0000Z" }

The result would return all of the documents which have the author "Franz, Kafka" and the title "The Trial", but which are unique documents, it would exclude all of the exact duplicates that are EXACTLY the same. Note also, that it would return the ENTIRE document, not just the fields I have aggregated.

In SQL this would look something like:

SELECT DISTINCT * from table where author='Kafka, Franz" and title='The Trial';

Things I have tried:

Aggs returns the count, I want the values themselves. E.g., if I use an aggregator it tells me how many results match but I want it to return every unique document that matches some field. This is like SELECT COUNT(DISTINCT *).
Other solutions I have seen show the values, but only the values of the aggregated fields. This is like SELECT DISTINCT author, title from table...I want to return the entire document. Like this answer: ElasticSearch - Return Unique Values
I've also seen results where the "WHERE" part is missing, e.g., it is like SELECT DISTINCT * FROM TABLE; whereas I want also a filter on the results, only those results which match author and title, e.g., WHERE author='Kafka, Franz' and TITLE='The Trial';
Note that there may be hundreds (or thousands) of exact duplicates and I have to live with this, I cannot remove the duplicates. And the query needs to be very efficient. Is this even a reasonable request for ElasticSearch? I don't know anything at all about ElasticSearch before yesterday.

edited Jan 19, 2018 at 0:53

Daniel

15.7k20 gold badges117 silver badges184 bronze badges

asked Jan 19, 2018 at 0:33

Halo

111 silver badge2 bronze badges

You have done a reasonable research in the topic. Unfortunately I have no idea how you could do that in a search. Maybe ask management and make a secondary database that is created based on the original and which has no duplicates. Sry, I have no better idea. :/

MrSimple
– MrSimple

2018-01-19 09:03:25 +00:00
Commented Jan 19, 2018 at 9:03
Thanks, @MrSimple, I am new to ES but I think you are right, I think that there is no reasonable way to do this efficiently without going to the root of the problem and dealing with the duplication problem itself.

Halo
– Halo

2018-01-19 19:14:45 +00:00
Commented Jan 19, 2018 at 19:14

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Python ElasticSearch Query with Tons of Duplicate Documents

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked