Possible to retrieve multiple random, non-sequential documents from MongoDB?

Question

I'd like to retrieve a random set of documents from a MongoDB database. So far after lots of Googling, I've only seen ways to retrieve one random document OR a set of documents starting at a random skip position but where the documents are still sequential.

I've tried mongoose-simple-random, and unfortunately it doesn't retrieve a "true" random set. What it does is skip to a random position and then retrieve n documents from that position.

Instead, I'd like to retrieve a random set like MySQL does using one query (or a minimal amount of queries), and I need this list to be random every time. I need this to be efficient -- relatively on par with such a query with MySQL. I want to reproduce the following but in MongoDB:

SELECT * FROM products ORDER BY rand() LIMIT 50;

Is this possible? I'm using Mongoose, but an example with any adapter -- or even a straight MongoDB query -- is cool.

I've seen one method of adding a field to each document, generating a random value for each field, and using {rand: {$gte:rand()}} each query we want randomized. But, my concern is that two queries could theoretically return the same set.

If you can "retrieve one random document" then you can retrieve multiple by repeating, no? — Mitch Wheat
– Mitch Wheat, Commented Dec 7, 2014 at 22:53
I think that would be inefficient -- I need this to be on par with a MySQL rand() sorted query. — Chad Johnson
– Chad Johnson, Commented Dec 7, 2014 at 22:56

dotpush · Accepted Answer · 2014-12-08 02:03:42Z

2

You may do two requests, but in an efficient way :

Your first request just gets the list of all "_id" of document of your collections. Be sure to use a mongo projection db.products.find({}, { '_id' : 1 }).
You have a list of "_id", just pick N randomly from the list.
Do a second query using the $in operator.

What is especially important is that your first query is fully supported by an index (because it's "_id"). This index is likely fully in memory (else you'd probably have performance problems). So, only the index is read while running the first query, and it's incredibly fast.

Although the second query means reading actual documents, the index will help a lot.

If you can do things this way, you should try.

answered Dec 8, 2014 at 2:03

dotpush

4283 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Chad Johnson Over a year ago

If I have 500,000 documents in my collection, would this still be efficient?

Sammaye Over a year ago

@ChadJohnson nah, not even close, you would need a separate field: stackoverflow.com/questions/2824157/random-record-from-mongodb try looking at anything but the first answer there

dotpush Over a year ago

@Chad Johnson : The best way of knowing it is probably to try on your collection. For the first request, to achieve your goal (really random documents), you should not use a limit. However, if you just want to test that the first request does not imply something too intensive on your production system, you may try it with a limit of 1000, then 5000, 25000... Until you reach the number of documents in your collection and confirmed everything is correct.

dotpush Over a year ago

@Sammaye : Could you link to the specific answer that, according to you, does the job the best way to pick N (for example 50) random documents ?

Sammaye Over a year ago

stackoverflow.com/a/5517206/383478 would work quite well with some changes for this specific scenario, straight indexed query pulling out only what is needed

|

wdberkeley · Accepted Answer · 2014-12-10 16:09:12Z

0

I don't think MySQL ORDER BY rand() is particularly efficient - as I understand it, it essentially assigns a random number to each row, then sorts the table on this random number column and returns the top N results.

If you're willing to accept some overhead on your inserts to the collection, you can reduce the problem to generating N random integers in a range. Add a counter field to each document: each document will be assigned a unique positive integer, sequentially. It doesn't matter what document gets what number, as long as the assignment is unique and the numbers are sequential, and you either don't delete documents or you complicate the counter document scheme to handle holes. You can do this by making your inserts two-step. In a separate counter collection, keep a document with the first number that hasn't been used for the counter. When an insert occurs, first findAndModify the counter document to retrieve the next counter value and increment the counter value atomically. Then insert the new document with the counter value. To find N random values, find the max counter value, then generate N distinct random numbers in the range defined by the max counter, then use $in to retrieve the documents. Most languages should have random libraries that will handle generating the N random integers in a range.

edited Dec 10, 2014 at 16:09

answered Dec 8, 2014 at 17:20

wdberkeley

11.7k1 gold badge30 silver badges24 bronze badges

3 Comments

dotpush Over a year ago

"as long as the assignment is unique and the numbers are sequential" -> I'd add also as long as documents are never deleted.

Sammaye Over a year ago

Doesn't rand() actually pick from a AI key on the table if I remember correctly?

wdberkeley Over a year ago

@dotpush - excellent point. It does require docs aren't deleted. I've edited the answer. You could make the numbering scheme more complicated to allow deletions. I think it might be easier just to do single random draws than have to structure the use of the collection around drawing samples, for many use cases.

Collectives™ on Stack Overflow

Possible to retrieve multiple random, non-sequential documents from MongoDB?

2 Answers 2

7 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related