I have a huge mongoDB containing documents on which I am using a name as index.
So basically, I had a text file containing 48 000 016 entries. (I use wc -l to obtain that count)
To give more context, the database contains a lot of names that we're extracted from OCR (so a lot of junk) and also names in other languages (Japanese, Russian, etc...).
My MongoDB table statistics tell me I have 48 000 016 which is fine.
The problem happens because I want to query the items on their names (which is a standard string) using this regex :
/^([A-Z]|\W|\s|\d|_)/i
So my checklist :
- any letter - check
- case insensitive - check
- any number - check
- underscore - check
- \W for anything that is not a number, letter or underscore.
So from what I understand, this regex should get me everything, since I am querying database on string values with this regex. But the problem is that I am missing 5 items.
When I run the count on the result of the query, I have 48 000 011 items.
Any idea where these 5 ones could be ? Because of the nature of my problem I can simply go through all my items using a simple cursor, I know it could be done that way, but I need a regex that can retrieve all my values.
I ran this query on the Database as indicated by the comments.
db.name.aggregate({$group:{_id:"uniqueDocs", count:{$sum:1}}})
Result is :
{ "result" : [ ], "ok" : 1 }
Thanks a lot !
db.<insertYourCollectionNameHere>.aggregate({$group:{_id:"uniqueDocs",count:{$sum:1}}})and add this to your question by editing it\n\rto your regex, see my updated answer.