3

I have a list of strings where I need to count the number of list entries that have an occurances of a specific string inside of them (and the whole thing only for a subset of the list not the whole list).

The code below works quite well BUT its performance is.....sadly not in an acceptable niveau as I need to parse through 500k to 900k list entries.For these entries I need to run the code below about 10k times (as I have 10k parts of the list I need to analyse). For that it takes 177 seconds and even more. So my question is how can I do this...fast?

private int ExtraktNumbers(List<string> myList, int start, int end)
{
    return myList.Where((x, index) => index >= start && index <= end 
                        && x.Contains("MYNUMBER:")).Count();
}
25
  • 1
    I have tested your code ony my pc with a list that contains 1000000 strings and it needed 555 milliseconds for 10 runs. The range was from 50000 to 100000. So either your strings are huge or your bottleneck is somewhere else. Commented Sep 25, 2015 at 9:54
  • 3
    Instead of using Linq extension methods, have you tried to see how a simple iterative solution performs? for(int i = start; i <= end; i++)... Commented Sep 25, 2015 at 9:54
  • 1
    @DmytroShevchenko I didn't try the iterative part so far BUT I take it it would probably do the same as: MyList.GetRange(start, end).Where(x =Y x.contains("MyNumber:")),Count(); at least in essence? (just tried that one after you mentioned the iterative and that has a HUGE performance increase). Commented Sep 25, 2015 at 9:58
  • 1
    @Thomas, actually, since you've mentioned that you're calling this method about 10k times... that would be the best thing to optimize. Can you first collect all the substrings you want to count in this list, and then go through the list once, while maintaining a counter for each substring's number of matches? Commented Sep 25, 2015 at 10:18
  • 1
    @Thomas: you are calling the method 10000 times whereas our(?) approach just needs one call for all ranges. You need to check all occurences once, then you have all indices. Now you have still to check 1000 times but not against the whole list which contains 900k strings but only against the List<int> which contains less numbers. You also don't need to search the substring anymore but only to compare the index. The range check could look like: indicesWithMyNumber.Count(i => i >= start && i <= end) Commented Sep 25, 2015 at 10:40

4 Answers 4

3

Well now we know you are calling the method 10,00 times here is my suggestion. I assume as you have hardcoded "Number:" that it means you are doing different ranges with each call? So if that's the case...

First, run an 'indexing' method and create a list of which indices are a match. Then you can easily count up the matches for the ranges you need.

NOTE: This is something quick, and you may even be able to further optimize this too:

List<int> matchIndex = new List<int>();

void RunIndex(List<string> myList)
{
    for(int i = 0; i < myList.Count; i++)
    {
        if(myList[i].Contains("MYNUMBER:"))
        {
            matchIndex.Add(i);
        }
    }
}

int CountForRange(int start, int end)
{
    return matchIndex.Count(x => x >= start && x <= end);
}

Then you can use like this, for example:

RunIndex(myList);

// I don't know what code you have here, this is just basic example.
for(int i = 0; i <= 10,000; i++)
{
    int count = CountForRange(startOfRange, endOfRange);
    // Do something with count.
}

In addition, if you have a lot of duplication in the ranges you check then you could consider caching range counts in a dictionary, but at this stage it's hard to tell if that will be worth doing anyway.

Sign up to request clarification or add additional context in comments.

2 Comments

Are you sure that would work as intended? because the startofrange and endofrange is determined by the original list but matchindex has a different number of entries than them
yeah, it should do, because matchIndex knows what the indices are. For example: If it contains matches [1, 2, 4, 7, 9] and you want range from 4 - 9, you will get result of 3 which would be correct. Note that the matchIndex.Count() is not using the index of the matchIndex list, but the values stored from the original indices. Give it a try and let me know, I am very interested to know if it works for you
2

I am pretty sure a simple iterative solution will perform better:

private int ExtractNumbers(List<string> myList, int start, int end)
{
    int count = 0;

    for (int i = start; i <= end; i++)
    {
        if (myList[i].Contains("MYNUMBER:"))
        {
            count++;
        }
    }

    return count;
}

4 Comments

Its always the simplest solutions one does not think about :). Currently finishing a try to measure times. If I'm not mistaken this variant should perform even better than an .GetRange and .Where combination?
Well, GetRange() performs a shallow copy of the elements in your range. The strings themselves shouldn't be copied, but still, it adds unneeded overhead.
GetRange is unnecessary, you don't want a copy of all sub-lists, that will cause an OutOfMemoryException sometime.
@TimSchmelter There are no sublists though, this is just a string list. The only things that will be copied over to the new collection are the references to the strings stored in the original list. This shouldn't cause OutOfMemoryException, but is still unnecessary.
1

Well for my test stand for 10 millions (10 times more than you have) lines

  var data = Enumerable
   .Range(1, 10000000)
   .Select(item => "123456789 bla-bla-bla " + "MYNUMBER:" + item.ToString())
   .ToList();

  Stopwatch sw = new Stopwatch();

  sw.Start();

  int result = ExtraktNumbers(data, 0, 10000000);

  sw.Stop();

I've got these results:

2.78 seconds - your initial implementtation

Naive loop (2.60 seconds):

private int ExtraktNumbers(List<string> myList, int start, int end) {
  int result = 0;

  for (int i = start; i < end; ++i)
    if (myList[i].Contains("MYNUMBER:"))
      result += 1;

  return result;
}

PLinq (1.72 seconds):

   private int ExtraktNumbers(List<string> myList, int start, int end) {
      return myList
        .AsParallel() // <- Do it in parallel
        .Skip(start - 1)
        .Take(end - start)
        .Where(x => x.Contains("MYNUMBER:"))
        .Count();
    }

Explicit parallel implementation (1.66 seconds):

   private int ExtraktNumbers(List<string> myList, int start, int end) {
     long result = 0;

     Parallel.For(start, end, (i) => {
       if (myList[i].Contains("MYNUMBER:"))
         Interlocked.Increment(ref result);
     });

     return (int) result;
  }

I just cannot reproduce your 177 seconds

1 Comment

Like I mentioned in the comments I GUESS its because I don't just call the method once, but about 10k times that it increases that much (not sure why but I had managed to narrow the time increase down to the method itself).
0

If you know from the beginning the intervals you want to consider, it's probably a good idea to loop the list once, as Dmytro and musefan proposed above, so I won't repeat the same idea again.

However I have a different suggestion for performance improvement. How do you create your list? Do you know the number of items in advance? Because for such a big list, you may gest a significant performance boost by using the List<T> constructor that takes the initial capacity.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.