I fetch data daily from a source which gives same record twice or thrice. I want to ensure unique record in my database without using the indexing.
1 Answer
If the records are in a canonical format and are not too large you could take a hash of each record and store the hashes as the _id key of the document during insertion (this is automatically indexed and you can't turn off the index).
In Python (you need the pymongo driver installed pip install pymongo).
>>> import json
>>> import pymongo
>>> client = pymongo.MongoClient()
>>> db=client['test']
>>> collection=db['test']
>>> a={"a":"b"}
>>> s = json.dumps(a)
>>> hash(s)
7926683998783294113
>>> collection.insert_one({"_id" : hash(s), "data" : a})
<pymongo.results.InsertOneResult object at 0x104ff7bc8>
>>> collection.insert_one({"_id" : hash(s), "data" : a})
...
raise DuplicateKeyError(error.get("errmsg"), 11000, error)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: test.test index: _id_ dup key: { : 7926683998783294113 }
>>>
So to detect your duplicate records just catch DuplicateKeyError. This approach
presumes that the dumps output is identical for identical records.