1

I was googling for some advise about this and I found some links. The most obvious was this one but in the end what im wondering is how well my code is implemented.

I have basically two classes. One is the Converter and the other is ConverterThread

I create an instance of this Converter class that has a property ThreadNumber that tells me how many threads should be run at the same time (this is read from user) since this application will be used on multi-cpu systems (physically, like 8 cpu) so it is suppossed that this will speed up the import

The Converter instance reads a file that can range from 100mb to 800mb and each line of this file is a tab-delimitted value record that is imported to another destination like a database.

The ConverterThread class simply runs inside the thread (new Thread(ConverterThread.StartThread)) and has event notification so when its work is done it can notify the Converter class and then I can sum up the progress for all these threads and notify the user (in the GUI for example) about how many of these records have been imported and how many bytes have been read.

It seems, however that I'm having some trouble because I get random errors about the file not being able to be read or that the sum of the progress (percentage) went above 100% which is not possible and I think that happens because threads are not being well managed and probably the information returned by the event is malformed (since it "travels" from one thread to another)

Do you have any advise on better practices of implementation of threads so I can accomplish this?

Thanks in advance.

3
  • definitely agree with the sentiments of the other posters when they say that the complexity/difficulty using multiple threads is probably going to outweigh any speed benefit... Commented Sep 16, 2009 at 2:03
  • Adding threads can very well improve read performance. I benchmarked this. See stackoverflow.com/questions/1033065/…. Commented Sep 16, 2009 at 7:27
  • Ok so I finally ended using one single thread to read the big file and creating as much files as threads the user configured so if the user set 4 threads I divide this big file in 4 different files. As soon as the thread finishes I create 4 threads and each one reads a different file and process each record. I haven't benchmarked this but I will and let you know. Thanks all for the responses. Commented Sep 18, 2009 at 16:28

3 Answers 3

10

I read very large files in some of my own code and, I have to tell you, I am skeptical of any claim that adding threads to a read operation would actually improve the overall read performance. In fact, adding threads might actually reduce performance by causing head seeks. It is highly likely that any file operations of this type would be I/O bound, not CPU bound.

Given that the author of the post you referenced never actually provided the 'real' code, his claims that multiple threads will speed up I/O remain untestable by others. Any attempt to improve hard disk read/write performance by adding threads would most certainly be I/O bound, unless he is doing some serious number crunching between reads, or has stumbled upon some happy coincidence having to do with the disk cache, in which case the performance improvement might be unreproduceable on another machine with different hardware characteristics.

Generally, when files of this size are involved, an additional 20% or 30% improvement in performance is not going to matter much, even if it is possible utilizing threads, because such a task would most certainly be considered a background task (not real-time). I use multiple threads for this kind of work, not because it improves read performance on one file, but because multiple files can be processed simultaneously in the background.

Before using threads to do this, I carefully benchmarked the software to see if threads would actually improve overall throughput. The results of the tests (on my development machine) were that using the same number of threads as the number of processor cores produced the maximum possible throughput. But that was processing ONE file per thread.

Sign up to request clarification or add additional context in comments.

Comments

10

Multiple threads reading a file at a time is asking for trouble. I would set up a producer consumer model such that the producer read the lines in the file, perhaps into a buffer, and then handed them out to the consumer threads when they complete processing their current work load. It does mean you have a blocking point where the lines are handed out but if processing takes much longer than reading then it shouldn't be that big of a deal. If reading is the slow part then you really don't need multiple consumers anyway.

2 Comments

Very well said, particularly the last part.
Actually the processing of the data is what takes the most. Indeed, what I'm doing right now is that the main thread reads the file line by line and as each line is consumed a new thread is created passing that line to the thread so it can process that information. As soon as a thread is done i fire up an event that tells me that thread has finished so I can create a new one so I dont create more than those the user pointed (num of threads is configurable)
0

You should try to just have one thread read the file, since multiple threads will likely be bound by the I/O anyway. Then you can feed the lines into a thread-safe queue from which multiple threads can dequeue lines to parse.

You won't be able to tell the progress of any one thread because that thread has no defined amount of work. However, you should be able to track approximate progress by keeping track of how many items (total) have been added to the queue and how many have been taken out. Obviously as your file reader thread puts more lines into the queue your progress will appear to decrease because more lines are available, but presumably you should be able to fill the queue faster than workers can process the lines.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.