8

Currently I have a mongo document that looks like this:

{'_id': id, 'title': title, 'date': date}

What I'm trying is to search within this document by title, in the database I have like 5ks items which is not much, but my file has 1 million of titles to search.

I have ensure the title as index within the collection, but still the performance time is quite slow (about 40 seconds per 1000 titles, something obvious as I'm doing a query per title), here is my code so far:

Work repository creation:

class WorkRepository(GenericRepository, Repository):
    def __init__(self, url_root):
        super(WorkRepository, self).__init__(url_root, 'works')
        self._db[self.collection].ensure_index('title')

The entry of the program (is a REST api):

start = time.clock()
for work in json_works: #1000 titles per request
    result = work_repository.find_works_by_title(work['title'])

    if result:
        works[work['id']] = result

end = time.clock()
print end-start

return json_encoder(request, works)

and find_works_by_title code:

def find_works_by_title(self, work_title):
    works = list(self._db[self.collection].find({'title': work_title}))

    return works

I'm new to mongo and probably I've made some mistake, any recommendation?

1 Answer 1

20

You're making one call to the DB for each of your titles. The roundtrip is going to significantly slow the process down (the program and the DB will spend most of their time doing network communications instead of actually working).

Try the following (adapt it to your program's structure, of course):

# Build a list of the 1000 titles you're searching for.
titles = [w["title"] for w in json_works]

# Make exactly one call to the DB, asking for all of the matching documents.
return collection.find({"title": {"$in": titles}})

Further reference on how the $in operator works: http://docs.mongodb.org/manual/reference/operator/query/in/

If after that your queries are still slow, use explain on the find call's return value (more info here: http://docs.mongodb.org/manual/reference/method/cursor.explain/) and check that the query is, in fact, using an index. If it isn't, find out why.

Sign up to request clarification or add additional context in comments.

3 Comments

Will try, but looks exactly what I was looking for :)
When using the in operator of course the performance increases, however I'm not getting any match (I'm adding a known value in every sequence of titles to make sure something matches) but not lucky until now.
Ok, my bad, used multiple [[ when generating the list, worked perfectly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.