1

I have a file with more than 500 000 urls. Now I want to read the file and parse every url with my function which return string message. For now everyting is working fine but the performance is not good so I need start the parsing in simulataneus threads (for example 100 threads)

ParseEngine parseEngine = new ParserEngine(parseFormulas);

StreamReader reader = new StreamReader("urls.txt");
String line = string.Empty;
while ((line = reader.ReadLine()) != null)
{
    string result = parseEngine.Parse(line);
    Console.WriteLine(result);
}
reader.Close();

It will be good when I can stop all the threads by button clicking and change the number of threads. Any help and tips?

3
  • 1
    Which version of .NET are you using? Commented Mar 28, 2011 at 19:38
  • Are you use your performance hit is in the Parse method call as opposed to the I/O itself? At the very least you should do some measurements just to see what benefit you'll gain by going multithreaded. Commented Mar 28, 2011 at 19:38
  • 4.0 version. Parse method need up to 60 seconds to finish; page downloading, html parse using html agility pack, regex matching and many other operations without I/O Commented Mar 28, 2011 at 19:47

4 Answers 4

2

Be sure to check out this article on PLINQ performance compared to other techniques for parsing a text file, line-by-line, using multi-threading.

Not only does it provide sample source code for doing something almost identical to what you want, but they also discovered a "gotcha" with PLINQ that can result in abnormally slow times. In a nutshell, if you try to use File.ReadAllLines() or StreamReader.ReadLine() you'll spoil the performance because PLINQ can't properly divide the file up that way. They solved the problem by reading all the lines into an indexed array, and THEN processing it with PLINQ.

Sign up to request clarification or add additional context in comments.

Comments

1

Honestly for the performance difference I would just try parallel foreach in .net 4.0 if that is an option.

 using System.Threading.Tasks;

  Parallel.ForEach(enumerableList, p =>{   
             parseEngine.Parse(p);   
     });

Its a decent start to running things parallel and should minimize your thread troubleshooting headaches.

Comments

1

A producer/consumer setup would be good for this. One thread reading from the file and writing to a Queue, and the other threads can read from the queue.

You mentioned and example of 100 threads. If you had this many threads, you would want to read from the Queue in batches, since you'd probably have to lock the Queue before reading as a Queue is only thread safe for a single reader+writer.

I think there is a new ConcurrentQueue generic in 4.0, but I can't remember for sure.

You really only want one reader to the file.

Comments

0

You could use Parallel.ForEach() to schedule a thread for each item in the list. That would spread the threads out among all available processors, assuming that parseEngine takes some time to run. If parseEngine runs pretty quickly (defined as less than 250ms), increase the number of "on-demand" threads by calling ThreadPool.SetMinThreads(), which will result in more threads executing at once.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.