Faster collection search with object having string property

Question

I have the following

public class SearchResult
{
    public string Description { get; set; }
    public int Year{ get; set; }
    public int Type { get; set; }
}

I create a list of these List and cache it then I try to search through this collection (1.2M) records with following

var found = myList.Where(x => x.Description.StartsWith(query)).Take(10).ToList();

This is pretty slow what I am going for, is there a better way to store a list of objects and have the ability to search the string property of the object?

Should I be sorting the collection before caching it? I would like to have the ability to do .StartsWith and .Contains on Description property with the fastest path to get top 10 matches.

If i just go to the db its faster (i have put a index on the text field), I am hoping to improve my performance by getting the results once, sticking them in memory and then all searches are done against the cache in memory vs going to the db each time. But this is proving to be slower then the db calls using the SQL LIKE '{query}%' statement

Because like 'query%' is sargable and it uses index to search. You can too create some tree from data to search. But it will be slow if you just want to do it once. Look for b-tree. It is how the index is stored. — Giorgi Nakeuri
– Giorgi Nakeuri, Commented Nov 21, 2015 at 5:20
The Description property contains a single word or a phrase? How about query? — Igor Bendrup
– Igor Bendrup, Commented Nov 21, 2015 at 10:27

David Pine · Accepted Answer · 2015-11-21 05:36:56Z

1

String comparisons are inherently slow, additionally you have to iterate the entire list a full time to see if there is a match. This is never going to perform well, and in fact will more than likely get worse as time goes on as new records are added to the source.

Here is a nice article on string searching for those concerned with speed.

I would suggest doing just as you mentioned, move the searching to the database and limit the number of returned rows. While this is still I/O, databases are optimized for handling this sort of thing. Some other advantages are that you end up not having the pitfall of having your app crash and losing that cached searches, likewise you can leverage async/await which will make you app more responsive.

If you decide that you still want to go the route of pulling everything into memory and then querying the objects, good luck. The only other suggestion I have would be to consider search caching, such that if someone searches for the same thing relatively recently - you could cache those results and immediately return them.

From the same author, here is another source for reading - here he compares collection string lookup speed.

http://cc.davelozinski.com/c-sharp/fastest-collection-for-string-lookups

edited Nov 21, 2015 at 5:36

answered Nov 21, 2015 at 5:31

David Pine

24.6k11 gold badges82 silver badges113 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

vgru Over a year ago

It's weird that these links don't even mention Boyer-Moore or Knutt-Morris-Pratt, which are probably much more efficient that all these other methods (presuming you need to do many lookups and can spare a couple of instructions on preprocessing the search string).

David Pine Over a year ago

Do you have an specific examples to share?

vgru Over a year ago

Nothing specific, implementations can be found on Google. Just wanted to say the list is not nearly complete (I didn't think the author tried using a compiled regex, for example). And the second link only compares built in ways for searching an exact string within a collection (where using a hashset is a no-brainer), which is not applicable to many problems where you need a partial match.

vgru Over a year ago

+1, I didn't realize OP needs Contains along with StartsWith, in which case your answer makes much more sense.

Igor Bendrup · Accepted Answer · 2015-11-21 19:22:39Z

1

First of all - it's an information retrieval task, and it seems like there isn't efficient way to do it using only LINQ.

What you need is some kind of reverse index. Very simple implementation of reverse index in your case is a Dictionary<string, List<SearchResult>>. For each word in a movie titles this Dictionary contains all movies, which title contains that word. To build this simple reverse index you can use following code:

        reverseIndex = new Dictionary<string, List<SearchResult>> ();
        for (int i = 0; i < searchResults.Count; i++) {
            var res = searchResults[i];
            string[] words = res.Description.Split (new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
            foreach (var word in words) {
                if (!reverseIndex.ContainsKey (word))
                    reverseIndex [word] = new List<SearchResult> () { res };
                else if (!reverseIndex[word].Contains(res))
                    reverseIndex [word].Add (res);              
            }
        }

Now, instead of slow:

searchResults.Where(x => x.Description.Contains(query));

you can use simple:

reverseIndex[query];

It works very fast. Instead of

searchResults.Where(x => x.Description.StartsWith(query));

you can use:

reverseIndex[query].Where(s => s.Description.StartsWith(query));

If your query contains more then one word, you can split it to words, then extract List<SearchResult> for each word, and then intersect lists.

With this simple implementation of reverse index your query can contain only whole words. If you want to search by a part of word, you need to use permuterm index. One possible implementation on C# you can find here. Note, that permuterm index need a lot of additional memory.

answered Nov 21, 2015 at 19:22

Igor Bendrup

2,8872 gold badges18 silver badges15 bronze badges

3 Comments

vgru Over a year ago

Why do you think query won't be a single letter? To me this looks like server side autocomplete functionality, and dictionary is not the appropriate data structure for this.

Igor Bendrup Over a year ago

OP says: "I would like to have the ability to do ... .Contains on Description property...". But trie allows only .StartsWith queries, doesn't it? So, I recomended to use permuterm index for wildcard queries.

vgru Over a year ago

Anyway, +1 for mentioning permuterm indexes, which would handle both StartsWith and Contains cases well, although it requires a fairly large index tree for storing all possible permutations. But permuterm is what answers OP's question, while a plain dictionary approach fails at anything but whole word lookups, so you might want to rearrange your question to emphasize that.

Community · Accepted Answer · 2017-05-23 12:30:17Z

1

Fast string prefix searches are best done using a trie data structure. What's cool about the trie is that all descendants of any given node have a common prefix of the string associated with that node. It can also be compacted into a radix tree, which is perhaps slightly more complex to implement.

Right now you are using Linq-to-objects to iterate through the entire list for every search (and each StartsWith method is O(m) with m being the length of the query string). If you used Linq-to-SQL, it would get translated to an SQL query which would use indexes to perform a more efficient lookup.

This link has an example implementation of the auto-complete functionality using a trie.

(Update)

As @David mentioned in the comments, a trie might be an overkill if you already have this data loaded in a list (that is, if you need to keep it in this form anyway, which you probably do). In that case, a better alternative for StartsWith queries would be to have the list sorted. That would allow you to use binary search to get your results in O(m log n).

Depending on whether the data is changed often, you might also use a balanced binary tree for storing the data, to allow you fast insertion/deletion (which is essentially what SortedDictionary gives you).

But ultimately, if you also need Contains queries, then you will either need to spare much more memory on indexing (like described by @Igor in his answer), or simply let your DMBS do it (suggested by @David).

On the other hand, you might want to try using SQL Server's full text search. Or, if you are willing to go a bit outside of writing SQL, Lucene.Net with in-memory caching should give you even better performance.

edited May 23, 2017 at 12:30

CommunityBot

11 silver badge

answered Nov 21, 2015 at 19:28

vgru

51.5k17 gold badges127 silver badges214 bronze badges

8 Comments

vgru Over a year ago

Here is another autocomplete example with a couple of nice pictures and the source code.

David Pine Over a year ago

The issue with this answer is that the TRIE approach is really going to only add overhead, i.e.; more objects in memory. It was stated that there are 1.2+ million records, as such pulling all of those into memory plus the extra data structures of the TRIE will only worsen performance. upcommons.upc.edu/bitstream/handle/2099.1/16165/…

vgru Over a year ago

@David: of course it's going to add memory overhead, that's what you are supposed to use your server's memory for. But worsen performance? How exactly? OP's current algorithm has a runtime complexity of O(n*m), where n is the number of records (1.2+ million of them), and m is the length of the query string. He needs to iterate through 1.2 million items for each query string. Now compare this to a trie, which needs at most m comparisons to get to the correct node. Not only using a trie will not worsen performance, it will improve it by several orders of magnitude.

David Pine Over a year ago

my point is being misconstrued. I challenge you to create a SQL DB with 1.2 million records where there is a strong field that has a varied length. Try performance testing both ways, i.e.; let SQL do the look up vs TRIE data structure. I know which is more performant but would love to be enlightened.

vgru Over a year ago

I am not sure what your point is, perhaps I misunderstood, but please clarify it. RDBMS are definitely smart, and with right indexes can perform incredibly well. They also utilize memory efficiently. But will an in-memory trie return a prefix search faster than a %like% query ran against a vanilla SQL Server instance? If implemented reasonably, yes. Just like a .NET in-memory dictionary will yield faster lookups than SQL queries by an indexed column. Just like a simple binary search on a sorted List<T> will run faster than a sargable SQL query.

|

Collectives™ on Stack Overflow

Faster collection search with object having string property

3 Answers 3

4 Comments

3 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

3 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related