3

Mongo DB Version 3.4.6

I have a collection with a document structure which resembles the following:

{
  organization: "ABC123",
  tags: ["MARTHA WASHINGTON", "+15552082000"],
  updatedAt : ISODate("2020-10-09T17:19:44.861Z"),
  createdAt : ISODate("2020-01-14T19:46:15.957Z"),
}

I need to be able to query by organization and a regex "starts with" on the tags array, and optionally sort by updatedAt or createdAt. To accomplish this, I created the following index:

{
    "organization" : 1,
    "tags" : 1,
    "createdAt" : -1
}

This is a multikey compound index which based on my understanding of Mongo should allow me to cover the query in all cases. If I execute a query like:

db.getCollection('data').find({"organization": "ABC123", "search": /^MARTHA WASHINGTO/})

The query is covered by the index - I see a single FETCH/IXSCAN stage.

Likewise, if I remove the regex query and add a sort - the query is perfectly covered.

db.getCollection('data').find({"organization": "ABC123", "search": "MARTHA WASHINGTON"}).sort({"createdAt":-1})

However, if I combine the regex and sort options, suddenly I see an extra SORT stage in my query. Example query:

db.getCollection('data').find({"organization": "ABC123", "search": /^MARTHA WASHINGTO/}).sort({"createdAt":-1})

Here is the winning plan output from the explain:

"winningPlan" : {
            "stage" : "SORT",
            "sortPattern" : {
                "createdAt" : -1.0
            },
            "inputStage" : {
                "stage" : "SORT_KEY_GENERATOR",
                "inputStage" : {
                    "stage" : "FETCH",
                    "inputStage" : {
                        "stage" : "IXSCAN",
                        "keyPattern" : {
                            "organization" : 1,
                            "tags" : 1,
                            "createdAt" : -1
                        },
                        "indexName" : "tag matches by organization",
                        "isMultiKey" : true,
                        "multiKeyPaths" : {
                            "organization" : [],
                            "search" : [ 
                                "search"
                            ],
                            "createdAt" : []
                        },
                        "isUnique" : false,
                        "isSparse" : false,
                        "isPartial" : false,
                        "indexVersion" : 2,
                        "direction" : "forward",
                        "indexBounds" : {
                            "organization" : [ 
                                "[\"ABC123\", \"ABC123\"]"
                            ],
                            "tags" : [ 
                                "[\"MARTHA WASHINGTON\", \"MARTHA WASHINGTOO\")", 
                                "[/^MARTHA WASHINGTON/, /^MARTHA WASHINGTON/]"
                            ],
                            "createdAt" : [ 
                                "[MaxKey, MinKey]"
                            ]
                        }
                    }
                }
            }
        },

I am stumped about why this combination of queries is not being covered by the index. My understanding is that the extra sort stage at the beginning will result in slow performance for large collections. Can anyone provide some guidance? Is there some limitation that I've missed?

Update: winning plan when the regex query is removed

   "winningPlan" : {
            "stage" : "FETCH",
            "inputStage" : {
                "stage" : "IXSCAN",
                "keyPattern" : {
                    "organization" : 1,
                    "search" : 1,
                    "createdAt" : -1
                },
                "indexName" : "tag matches by organization",
                "isMultiKey" : true,
                "multiKeyPaths" : {
                    "organization" : [],
                    "search" : [ 
                        "search"
                    ],
                    "createdAt" : []
                },
                "isUnique" : false,
                "isSparse" : false,
                "isPartial" : false,
                "indexVersion" : 2,
                "direction" : "forward",
                "indexBounds" : {
                    "organization" : [ 
                        "[\"ABC123\", \"ABC123\"]"
                    ],
                    "tags" : [ 
                        "[\"MARTHA WASHINGTON\", \"MARTHA WASHINGTON\"]"
                    ],
                    "createdAt" : [ 
                        "[MaxKey, MinKey]"
                    ]
                }
            }
        },
4
  • What is the query plan for the covered query? Commented Oct 9, 2020 at 19:04
  • @D.SM - I've added the winning plan per your request Commented Oct 9, 2020 at 20:37
  • Just saw the version number - test on 4.4.1 please. Commented Oct 9, 2020 at 21:56
  • I think the version number won't matter in this case. Commented Oct 9, 2020 at 22:07

2 Answers 2

1

The other answer is not quite accurate. From the docs

For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan.

Mongo is capable of utilizing an index with a regex, obviously if your regex is a suffix regex than a collection scan might actually be faster as Mongo will have to read the entire index tree to suffice it.

So what's happening in your query? why is the winning plan a sort? Well while it might be possible that it's actually the best way to fetch results there's also the possibility that Mongo simply chooses the wrong plan.

First let's understand how does Mongo choose a winning plan, Plan evaluation is based on comparing candidate plans for a given query to see which returns the first batch of results (by default 101 documents) with the least amount of overall "work". The works score is a proxy for different effort involved in query stages (index key comparisons, fetching documents, etc). If multiple plans perform identical work during evaluation, there are some small tie-breaking bonuses that can help choose a plan to cache. Basically Mongo performs a small "race" and waits to who wins.

So in your case due to the regex nature with indexes the sort stage wins, it's possible that if you ran the plans fully instead of a small sample a different plan would had been chosen.

I recommend you do your own tests using hint, this forces Mongo to use a certain index meaning you can force Mongo's winning plan for your query. I personally feel (obviously specific regex dependant) is that you could improve performance by doing it as sorting first is hardly every the "best" plan.

Sign up to request clarification or add additional context in comments.

Comments

0

Suppose you have two fields in the collection: name and age, and you filter by name and order by age.

Suppose you have the following documents:

JON 30
JON 45
JONATHAN 40

Suppose you create an index on (name, age). This index orders the documents as listed above.

If you query for name = JON and order by age, all of the conditions are exactly matched by the index and the output of (JON, 30), (JON, 45) can be obtained from index traversal only.

If you query for name =~ ^JON and order by age, the output you expect is now (JON, 30), (JONATHAN, 40), (JON, 45). This ordering does not exist in the index since the name match is now not exact, hence the server has to sort the result set to provide it.

2 Comments

Thank you, I think this makes sense. How "bad" is it to add the sort in this case as the collection size scales? Since its occurring on an index, is that ok since it is in memory?
The server would only sort whatever documents matched the conditions. If it's a small number it's not too much of an issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.