2

I have the following

public class SearchResult
{
    public string Description { get; set; }
    public int Year{ get; set; }
    public int Type { get; set; }
}

I create a list of these List and cache it then I try to search through this collection (1.2M) records with following

var found = myList.Where(x => x.Description.StartsWith(query)).Take(10).ToList();

This is pretty slow what I am going for, is there a better way to store a list of objects and have the ability to search the string property of the object?

Should I be sorting the collection before caching it? I would like to have the ability to do .StartsWith and .Contains on Description property with the fastest path to get top 10 matches.

If i just go to the db its faster (i have put a index on the text field), I am hoping to improve my performance by getting the results once, sticking them in memory and then all searches are done against the cache in memory vs going to the db each time. But this is proving to be slower then the db calls using the SQL LIKE '{query}%' statement

3
  • Because like 'query%' is sargable and it uses index to search. You can too create some tree from data to search. But it will be slow if you just want to do it once. Look for b-tree. It is how the index is stored. Commented Nov 21, 2015 at 5:20
  • The Description property contains a single word or a phrase? How about query? Commented Nov 21, 2015 at 10:27
  • @IgorBendrup they r movie titles, could be a word or phrase Commented Nov 21, 2015 at 17:21

3 Answers 3

1

String comparisons are inherently slow, additionally you have to iterate the entire list a full time to see if there is a match. This is never going to perform well, and in fact will more than likely get worse as time goes on as new records are added to the source.

Here is a nice article on string searching for those concerned with speed.

I would suggest doing just as you mentioned, move the searching to the database and limit the number of returned rows. While this is still I/O, databases are optimized for handling this sort of thing. Some other advantages are that you end up not having the pitfall of having your app crash and losing that cached searches, likewise you can leverage async/await which will make you app more responsive.

If you decide that you still want to go the route of pulling everything into memory and then querying the objects, good luck. The only other suggestion I have would be to consider search caching, such that if someone searches for the same thing relatively recently - you could cache those results and immediately return them.

From the same author, here is another source for reading - here he compares collection string lookup speed.

http://cc.davelozinski.com/c-sharp/fastest-collection-for-string-lookups

Sign up to request clarification or add additional context in comments.

4 Comments

It's weird that these links don't even mention Boyer-Moore or Knutt-Morris-Pratt, which are probably much more efficient that all these other methods (presuming you need to do many lookups and can spare a couple of instructions on preprocessing the search string).
Do you have an specific examples to share?
Nothing specific, implementations can be found on Google. Just wanted to say the list is not nearly complete (I didn't think the author tried using a compiled regex, for example). And the second link only compares built in ways for searching an exact string within a collection (where using a hashset is a no-brainer), which is not applicable to many problems where you need a partial match.
+1, I didn't realize OP needs Contains along with StartsWith, in which case your answer makes much more sense.
1

First of all - it's an information retrieval task, and it seems like there isn't efficient way to do it using only LINQ.

What you need is some kind of reverse index. Very simple implementation of reverse index in your case is a Dictionary<string, List<SearchResult>>. For each word in a movie titles this Dictionary contains all movies, which title contains that word. To build this simple reverse index you can use following code:

        reverseIndex = new Dictionary<string, List<SearchResult>> ();
        for (int i = 0; i < searchResults.Count; i++) {
            var res = searchResults[i];
            string[] words = res.Description.Split (new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
            foreach (var word in words) {
                if (!reverseIndex.ContainsKey (word))
                    reverseIndex [word] = new List<SearchResult> () { res };
                else if (!reverseIndex[word].Contains(res))
                    reverseIndex [word].Add (res);              
            }
        }

Now, instead of slow:

searchResults.Where(x => x.Description.Contains(query));

you can use simple:

reverseIndex[query];

It works very fast. Instead of

searchResults.Where(x => x.Description.StartsWith(query));

you can use:

reverseIndex[query].Where(s => s.Description.StartsWith(query));

If your query contains more then one word, you can split it to words, then extract List<SearchResult> for each word, and then intersect lists.

With this simple implementation of reverse index your query can contain only whole words. If you want to search by a part of word, you need to use permuterm index. One possible implementation on C# you can find here. Note, that permuterm index need a lot of additional memory.

3 Comments

Why do you think query won't be a single letter? To me this looks like server side autocomplete functionality, and dictionary is not the appropriate data structure for this.
OP says: "I would like to have the ability to do ... .Contains on Description property...". But trie allows only .StartsWith queries, doesn't it? So, I recomended to use permuterm index for wildcard queries.
Anyway, +1 for mentioning permuterm indexes, which would handle both StartsWith and Contains cases well, although it requires a fairly large index tree for storing all possible permutations. But permuterm is what answers OP's question, while a plain dictionary approach fails at anything but whole word lookups, so you might want to rearrange your question to emphasize that.
1

Fast string prefix searches are best done using a trie data structure. What's cool about the trie is that all descendants of any given node have a common prefix of the string associated with that node. It can also be compacted into a radix tree, which is perhaps slightly more complex to implement.

Right now you are using Linq-to-objects to iterate through the entire list for every search (and each StartsWith method is O(m) with m being the length of the query string). If you used Linq-to-SQL, it would get translated to an SQL query which would use indexes to perform a more efficient lookup.

This link has an example implementation of the auto-complete functionality using a trie.

(Update)

As @David mentioned in the comments, a trie might be an overkill if you already have this data loaded in a list (that is, if you need to keep it in this form anyway, which you probably do). In that case, a better alternative for StartsWith queries would be to have the list sorted. That would allow you to use binary search to get your results in O(m log n).

Depending on whether the data is changed often, you might also use a balanced binary tree for storing the data, to allow you fast insertion/deletion (which is essentially what SortedDictionary gives you).

But ultimately, if you also need Contains queries, then you will either need to spare much more memory on indexing (like described by @Igor in his answer), or simply let your DMBS do it (suggested by @David).

On the other hand, you might want to try using SQL Server's full text search. Or, if you are willing to go a bit outside of writing SQL, Lucene.Net with in-memory caching should give you even better performance.

8 Comments

Here is another autocomplete example with a couple of nice pictures and the source code.
The issue with this answer is that the TRIE approach is really going to only add overhead, i.e.; more objects in memory. It was stated that there are 1.2+ million records, as such pulling all of those into memory plus the extra data structures of the TRIE will only worsen performance. upcommons.upc.edu/bitstream/handle/2099.1/16165/…
@David: of course it's going to add memory overhead, that's what you are supposed to use your server's memory for. But worsen performance? How exactly? OP's current algorithm has a runtime complexity of O(n*m), where n is the number of records (1.2+ million of them), and m is the length of the query string. He needs to iterate through 1.2 million items for each query string. Now compare this to a trie, which needs at most m comparisons to get to the correct node. Not only using a trie will not worsen performance, it will improve it by several orders of magnitude.
my point is being misconstrued. I challenge you to create a SQL DB with 1.2 million records where there is a strong field that has a varied length. Try performance testing both ways, i.e.; let SQL do the look up vs TRIE data structure. I know which is more performant but would love to be enlightened.
I am not sure what your point is, perhaps I misunderstood, but please clarify it. RDBMS are definitely smart, and with right indexes can perform incredibly well. They also utilize memory efficiently. But will an in-memory trie return a prefix search faster than a %like% query ran against a vanilla SQL Server instance? If implemented reasonably, yes. Just like a .NET in-memory dictionary will yield faster lookups than SQL queries by an indexed column. Just like a simple binary search on a sorted List<T> will run faster than a sargable SQL query.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.