C# Fastest way to determine if a string contains all elements of a list

Question

Quick background. I have a string of words - I separate out those words into a List (I've tried HashSet it doesn't make any difference - and you lose the ordered nature of a List).

I then manipulate the original words in many dull ways - and create thousands of "new strings" - all of these strings are in a StringBuilder which has been set .ToString();

At the end of the manipulation, I want to QC those new strings - and be sure that every word that was in the original set - is still somewhere in those new strings and I have not accidentally lost a word.

That original string, can run to hundreds of individual words.

Short Example:

List<string> uniqueWords = new List<string> { "two", "three", "weather sunday" };

string final = "two and tomorrow\n\rtwo or wednesday\n\rtwo with thursday\n\rtwo without friday\n\rthree gone tomorrow\n\rthree weather saturday\n\rthree timely sunday";

The output string can run to tens of millions of characters, millions of words, 200,000+ rows of data (when split). You may notice that there are words that are actually two words separated by a space - so I cannot simply split out the individual words by splitting on the space as comparing them to the original would fail, and I need to confirm the words are exactly as they appeared originally - having weather somewhere and sunday somewhere - is not the same as having 'weather sunday' - for my purposes.

The the code I have tried so far and have benchmarked:

First attempt:

var allWords = uniqueWords.Where(substring => final.Contains(substring, StringComparison.CurrentCultureIgnoreCase)).ToList();

Second Attempt:

List<string> removeableList = new(uniqueWords);
foreach (var item in uniqueWords)
{
    if (removeableList.Count == 0)
    {
        break;
    }
    if (final.Contains(item))
    {
        removeableList.Remove(item);
    }
}

Third Attempt:

List<string> removeableList = new(uniqueWords);
for (int i = uniqueWords.Count; i >= 0; i--)
{
    if (removeableList.Count == 0)
    {
        break;
    }
    if (final.Contains(uniqueWords[i]))
    {
        removeableList.Remove(uniqueWords[i]);
    }
}

These are the results:

These results are repeatable, though I will say that the First Attempt tends to fluctuate quite a lot while the Second and Third Attempts tend to remain at about the same level - the Third Attempt does seem to do better than the Second rather consistently.

Are there any options that I am missing?

I have tried it using a Regex Matches collection into a HashSet - oh that was bad, 4 times worse than the First Attempt.

If there is a way to improve the performance on this task I would love to find it.

Why is it that on the first attempt you are ignoring case but on the second and third you are not? This may explain the performance difference. — Matt Bommicino
– Matt Bommicino, Commented Aug 24, 2022 at 23:52
Are you asking why the first method is slower than the others? Or are you asking if there is something that can do the job faster than your 3 attempts? — WBuck
– WBuck, Commented Aug 25, 2022 at 1:53
.Remove(uniqueWords[i]) will search the collection. You want .RemoveAt(i). foreach(var item in uniqueWords) will be compiled to for(var i=0; ....) for arrays, so doing that by hand doesn't save you anything. — Jeremy Lakeman
– Jeremy Lakeman, Commented Aug 25, 2022 at 7:06
Do you care about word breaks? eg "something".Contains("thing") — Jeremy Lakeman
– Jeremy Lakeman, Commented Aug 25, 2022 at 7:11

Charlieface · Accepted Answer · 2022-08-25 11:58:11Z

Your attempt #1 uses CurrentCultureIgnoreCase which will be slow. But even after removing that, you are adding to the list, rather than removing, and therefore the list might need to be resized.

You are also measuring two different things: option #1 is getting the list of words which are in final, the others get the list of words which are not.

Further options include:

Use List.RemoveAll

List<string> remainingWords = new(uniqueWords);
remainingWords.RemoveAll(final.Contains);   // use delegate directly, without anonymous delegate

Use a pre-sized list and use Linq

List<string> remainingWords = new(uniqueWords.Length);
remainingWords.AddRange(uniqueWords.Where(s => !final.Contains(s)));

Each of these two options can be flipped depending on what result you are trying to achieve, as mentioned.

List<string> words = new(uniqueWords);
words.RemoveAll(s => !final.Contains(s));

List<string> words = new(uniqueWords.Length);
words.AddRange(uniqueWords.Where(final.Contains));   // use delegate directly, without anonymous delegate

Kennetic · Accepted Answer · 2022-08-25 10:52:00Z

0

@Charlieface, thanks for that - I tried those, I think you have a point about adding to a list - as that appears much slower. For me it doesn't matter whether it is adding or removing, the result is a True/False return - whether the list is empty or of the size of the original list.

Sixth Attempt:

List<string> removeableList = new(uniqueWords.Count);
removeableList.AddRange(uniqueWords.Where(s => !parsedTermsComplete!.Contains(s)));

Seventh Attempt:

List<string> removeableList = new(uniqueWords);
removeableList.RemoveAll(parsedTermsComplete!.Contains);

Results in comparison to Third Attempt (fastest generally):

The adding does appear slower - and memory is a little higher for the RemoveAll but timing is consistent - bearing in mind it fluctuates depending on what Windows decides to do at any given moment...

Here is an interesting implementation of the AhoCorasickTree method - which I saw mentioned on this site somewhere else.

My knowledge on this is extremely limited so this may not be a good implementation at all - I am not saying it is a good implementation just that it works - this comes from a nuget package, but I am unsure on SO's policy on nuget package links, so won't link for now. In testing, creating an array was faster than creating a list.

Eighth Attempt:

var wordArray = uniqueWords.ToArray();
int i = uniqueWords.Count - 1;
foreach (var item in wordArray)
{
    var keyWords = new AhoCorasickTree(new[] { item });
    if (keyWords.Contains(parsedTermsComplete))
    {
        uniqueWords.RemoveAt(i);
    }
    i--;
}

I noticed in testing that creating a "removableList" was actually slower than creating a removableArray (found this out implementing the above Aho run). I updated the Third Attempt to incorporate this:

var removeableArray = uniqueWords.ToArray();
for (int i = removeableArray.Length -1; i >= 0; i--)
{
    if (!uniqueWords.Any())
    {
        break;
    }
    if (parsedTermsComplete!.Contains(removeableArray[i]))
    {
        uniqueWords.RemoveAt(i);
    }
}

The Benchmarks come out like this, the Third Attempt is updated to an array, the Seventh Attempt is the AhoCorasick implementation on a list, and the Eighth Attempt is the AhoCorasick implementation on an Array.

The ToArray - does seem faster than List, which is good to know.

My only issue with the AhoCorasick is that in practice - in a WASM application - this is actually much slower, so not a good option for me - but I put it here because it does seem to be much faster in Benchmarks (may be using multiple threads where WASM is limited to 1) and doesn't appear to allocate any memory, so might be useful to someone - interesting that the Third Attempt also appears to be allocated no memory when using an Array implementation whereas on a list it was allocated.

edited Aug 25, 2022 at 10:52

answered Aug 25, 2022 at 10:45

Kennetic

1231 silver badge10 bronze badges

6 Comments

Charlieface Over a year ago

If you want just a true/false then all you need is uniqueWords.Any(final.Contains) which is likely to be the fastest

Kennetic Over a year ago

As I mentioned only in the last post, this is a WASM application and that really seems to change things from the Benchmarks, and your final example is marginally faster than the First Attempt in the first post (1:03 vs 1:15) - so I think your final example here is the way to go for a WASM app. - everything else, all the foreach, for i, aho (which is multithreaded) - all are up at the 2-3 min mark.

Kennetic Over a year ago

Hi Charlie, is the problem with Any - that it will return True if just 1 element is in the string? When the test to be applied is that ALL elements in the List are in the string?

Charlieface Over a year ago

.All is what you are looking for

Kennetic Over a year ago

Yeah found that earlier...oddly, exact same speed in benchmarking as the loops. So back at square one.

|

Collectives™ on Stack Overflow

C# Fastest way to determine if a string contains all elements of a list

2 Answers 2

Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related