1

I have a huge mongoDB containing documents on which I am using a name as index.

So basically, I had a text file containing 48 000 016 entries. (I use wc -l to obtain that count)

To give more context, the database contains a lot of names that we're extracted from OCR (so a lot of junk) and also names in other languages (Japanese, Russian, etc...).

My MongoDB table statistics tell me I have 48 000 016 which is fine.

The problem happens because I want to query the items on their names (which is a standard string) using this regex :

 /^([A-Z]|\W|\s|\d|_)/i

So my checklist :

  • any letter - check
  • case insensitive - check
  • any number - check
  • underscore - check
  • \W for anything that is not a number, letter or underscore.

So from what I understand, this regex should get me everything, since I am querying database on string values with this regex. But the problem is that I am missing 5 items.

When I run the count on the result of the query, I have 48 000 011 items.

Any idea where these 5 ones could be ? Because of the nature of my problem I can simply go through all my items using a simple cursor, I know it could be done that way, but I need a regex that can retrieve all my values.

I ran this query on the Database as indicated by the comments.

db.name.aggregate({$group:{_id:"uniqueDocs", count:{$sum:1}}}) 

Result is :

{ "result" : [ ], "ok" : 1 }

Thanks a lot !

3
  • How about to inverse the regex and check the results? Commented Jul 25, 2016 at 4:45
  • Please use run db.<insertYourCollectionNameHere>.aggregate({$group:{_id:"uniqueDocs",count:{$sum:1}}}) and add this to your question by editing it Commented Jul 25, 2016 at 5:03
  • Try to include \n\rto your regex, see my updated answer. Commented Jul 25, 2016 at 13:12

2 Answers 2

1

I have seen you are using the anchor ^ to match the beginnig of a line. It could be possible that the line start with an new line \n or carriage return character \r.

Try to include \n and \r to your regex

/^([A-Z]|\W|\s|\d|\r|\n|_)/i

Also check to remove the anchor.

/([A-Z]|\W|\s|\d|\r|\n|_)/i

At last option inverse your regex to see which records are not included. These regex expressions should also math empty strings.

/^(?![.*])/i
Sign up to request clarification or add additional context in comments.

3 Comments

I forgot to mention in the question, the regex I showed is the compact version, in reality I am using something like this : A|B|C|D... that I generate from a python array because I want to process the database in multiple concurrent processes. I double check and there's no error in these. They correctly give me the same result as the posted regex. So the problem lies elsewhere.
I tried both commands, it still give me the same results. From what I can read, \W should match on everything else, but it's not. So my guess is that I have empty strings or special characters that cannot be processed by regex. Is that possible ?
@ElCapitaine, empty strings could be a good explanation. Try to inverse the regex ^(?![.*]) and take a look at the results. This should also find empty strings
0

I want to thank @Paul Wasilewski for giving me some great solutions. I found my problem which was not related to a regex problem.

My 5 entries we're simply not indexed, their size was more than 1024 bytes in length so MongoDB could not index them.

So that's the reason why they could not be queried by regex.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.