How can I fast count how often a string appears in a given part of a sting list?

Question

I have a list of strings where I need to count the number of list entries that have an occurances of a specific string inside of them (and the whole thing only for a subset of the list not the whole list).

The code below works quite well BUT its performance is.....sadly not in an acceptable niveau as I need to parse through 500k to 900k list entries.For these entries I need to run the code below about 10k times (as I have 10k parts of the list I need to analyse). For that it takes 177 seconds and even more. So my question is how can I do this...fast?

private int ExtraktNumbers(List<string> myList, int start, int end)
{
    return myList.Where((x, index) => index >= start && index <= end 
                        && x.Contains("MYNUMBER:")).Count();
}

I have tested your code ony my pc with a list that contains 1000000 strings and it needed 555 milliseconds for 10 runs. The range was from 50000 to 100000. So either your strings are huge or your bottleneck is somewhere else. — Tim Schmelter
– Tim Schmelter, Commented Sep 25, 2015 at 9:54
Instead of using Linq extension methods, have you tried to see how a simple iterative solution performs? for(int i = start; i <= end; i++)... — Dmytro Shevchenko
– Dmytro Shevchenko, Commented Sep 25, 2015 at 9:54
@DmytroShevchenko I didn't try the iterative part so far BUT I take it it would probably do the same as: MyList.GetRange(start, end).Where(x =Y x.contains("MyNumber:")),Count(); at least in essence? (just tried that one after you mentioned the iterative and that has a HUGE performance increase). — Thomas
– Thomas, Commented Sep 25, 2015 at 9:58
@Thomas, actually, since you've mentioned that you're calling this method about 10k times... that would be the best thing to optimize. Can you first collect all the substrings you want to count in this list, and then go through the list once, while maintaining a counter for each substring's number of matches? — Dmytro Shevchenko
– Dmytro Shevchenko, Commented Sep 25, 2015 at 10:18
@Thomas: you are calling the method 10000 times whereas our(?) approach just needs one call for all ranges. You need to check all occurences once, then you have all indices. Now you have still to check 1000 times but not against the whole list which contains 900k strings but only against the List<int> which contains less numbers. You also don't need to search the substring anymore but only to compare the index. The range check could look like: indicesWithMyNumber.Count(i => i >= start && i <= end) — Tim Schmelter
– Tim Schmelter, Commented Sep 25, 2015 at 10:40

musefan · Accepted Answer · 2015-09-25 10:38:28Z

3

Well now we know you are calling the method 10,00 times here is my suggestion. I assume as you have hardcoded "Number:" that it means you are doing different ranges with each call? So if that's the case...

First, run an 'indexing' method and create a list of which indices are a match. Then you can easily count up the matches for the ranges you need.

NOTE: This is something quick, and you may even be able to further optimize this too:

List<int> matchIndex = new List<int>();

void RunIndex(List<string> myList)
{
    for(int i = 0; i < myList.Count; i++)
    {
        if(myList[i].Contains("MYNUMBER:"))
        {
            matchIndex.Add(i);
        }
    }
}

int CountForRange(int start, int end)
{
    return matchIndex.Count(x => x >= start && x <= end);
}

Then you can use like this, for example:

RunIndex(myList);

// I don't know what code you have here, this is just basic example.
for(int i = 0; i <= 10,000; i++)
{
    int count = CountForRange(startOfRange, endOfRange);
    // Do something with count.
}

In addition, if you have a lot of duplication in the ranges you check then you could consider caching range counts in a dictionary, but at this stage it's hard to tell if that will be worth doing anyway.

edited Sep 25, 2015 at 10:38

answered Sep 25, 2015 at 10:28

musefan

48.5k21 gold badges144 silver badges193 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Thomas Over a year ago

Are you sure that would work as intended? because the startofrange and endofrange is determined by the original list but matchindex has a different number of entries than them

musefan Over a year ago

yeah, it should do, because matchIndex knows what the indices are. For example: If it contains matches [1, 2, 4, 7, 9] and you want range from 4 - 9, you will get result of 3 which would be correct. Note that the matchIndex.Count() is not using the index of the matchIndex list, but the values stored from the original indices. Give it a try and let me know, I am very interested to know if it works for you

Dmytro Shevchenko · Accepted Answer · 2015-09-25 10:01:43Z

2

I am pretty sure a simple iterative solution will perform better:

private int ExtractNumbers(List<string> myList, int start, int end)
{
    int count = 0;

    for (int i = start; i <= end; i++)
    {
        if (myList[i].Contains("MYNUMBER:"))
        {
            count++;
        }
    }

    return count;
}

answered Sep 25, 2015 at 10:01

Dmytro Shevchenko

34.9k6 gold badges56 silver badges68 bronze badges

4 Comments

Thomas Over a year ago

Its always the simplest solutions one does not think about :). Currently finishing a try to measure times. If I'm not mistaken this variant should perform even better than an .GetRange and .Where combination?

Dmytro Shevchenko Over a year ago

Well, GetRange() performs a shallow copy of the elements in your range. The strings themselves shouldn't be copied, but still, it adds unneeded overhead.

Tim Schmelter Over a year ago

GetRange is unnecessary, you don't want a copy of all sub-lists, that will cause an OutOfMemoryException sometime.

Dmytro Shevchenko Over a year ago

@TimSchmelter There are no sublists though, this is just a string list. The only things that will be copied over to the new collection are the references to the strings stored in the original list. This shouldn't cause OutOfMemoryException, but is still unnecessary.

Dmitrii Bychenko · Accepted Answer · 2015-09-25 10:14:33Z

1

Well for my test stand for 10 millions (10 times more than you have) lines

  var data = Enumerable
   .Range(1, 10000000)
   .Select(item => "123456789 bla-bla-bla " + "MYNUMBER:" + item.ToString())
   .ToList();

  Stopwatch sw = new Stopwatch();

  sw.Start();

  int result = ExtraktNumbers(data, 0, 10000000);

  sw.Stop();

I've got these results:

2.78 seconds - your initial implementtation

Naive loop (2.60 seconds):

private int ExtraktNumbers(List<string> myList, int start, int end) {
  int result = 0;

  for (int i = start; i < end; ++i)
    if (myList[i].Contains("MYNUMBER:"))
      result += 1;

  return result;
}

PLinq (1.72 seconds):

   private int ExtraktNumbers(List<string> myList, int start, int end) {
      return myList
        .AsParallel() // <- Do it in parallel
        .Skip(start - 1)
        .Take(end - start)
        .Where(x => x.Contains("MYNUMBER:"))
        .Count();
    }

Explicit parallel implementation (1.66 seconds):

   private int ExtraktNumbers(List<string> myList, int start, int end) {
     long result = 0;

     Parallel.For(start, end, (i) => {
       if (myList[i].Contains("MYNUMBER:"))
         Interlocked.Increment(ref result);
     });

     return (int) result;
  }

I just cannot reproduce your 177 seconds

edited Sep 25, 2015 at 10:14

answered Sep 25, 2015 at 10:09

Dmitrii Bychenko

188k20 gold badges178 silver badges231 bronze badges

1 Comment

Thomas Over a year ago

Like I mentioned in the comments I GUESS its because I don't just call the method once, but about 10k times that it increases that much (not sure why but I had managed to narrow the time increase down to the method itself).

async · Accepted Answer · 2015-09-25 10:42:10Z

0

If you know from the beginning the intervals you want to consider, it's probably a good idea to loop the list once, as Dmytro and musefan proposed above, so I won't repeat the same idea again.

However I have a different suggestion for performance improvement. How do you create your list? Do you know the number of items in advance? Because for such a big list, you may gest a significant performance boost by using the List<T> constructor that takes the initial capacity.

answered Sep 25, 2015 at 10:42

async

1,53711 silver badges29 bronze badges

Collectives™ on Stack Overflow

How can I fast count how often a string appears in a given part of a sting list?

4 Answers 4

2 Comments

4 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related