Multiple threads for content retrieval (GET) and data write on disk

Question

I need to make different GET queries to a server to download a bunch of json files and write each download to disk and I want to launch some threads to speed that up.

Each download and writting of each file takes approximately 0.35 seconds.

I would like to know if, under linux at least (and under Windows since we are here), it is safe to write in parallel to disk and how many threads can I launch taking into account the waiting time of each thread.

If it changes something (I actually think so), the program doesn't write directly to disk. It just calls std::system to run the program wget because it is currently easier to do that way than importing a library. So, the waiting time is the time that the system call takes to return.

So, each writting to disk is being performed by a different process. I only wait that program to finish, and I'm not actually bound by I/O, but by the running time of an external process (each wget call creates and writes to a different file and thus they are completely independent processes). Each thread just waits for one call to complete.

My machine has 4 CPUs.

Some kind of formula to get an ideal number of threads according to CPU concurrency and "waiting time" per thread would be welcome.

NOTE: The ideal solution will be of course to make some performance testing, but I could be banned for the server if I abuse with so many request.

Be aware that there is no guarantee that the OS will schedule your threads on different cores or CPUs. Worst case, the OS will cycle your threads on one CPU (core). — Thomas Matthews
– Thomas Matthews, Commented Oct 6, 2017 at 23:08
Most disk I/O is serial, one transaction at a time. More than one thread accessing the disk will be queued up so that they wait for another thread before they can use the disk. You may find that one thread accessing the disk is faster than multiple threads accessing the disk. — Thomas Matthews
– Thomas Matthews, Commented Oct 6, 2017 at 23:10
One rule of performance for disk writing is to keep writing (non stop), or write as much data in one transaction as possible. In mechanical disk drives, the objective is to keep the platters spinning so there is no delay in starting or stopping the drive. — Thomas Matthews
– Thomas Matthews, Commented Oct 6, 2017 at 23:13
@ThomasMatthews I know that maybe my program performs like if it were single-threaded in the words case, but let's assume that is not the case (I cannot do nothing to avoid that). About writting to disk, take into account that I'm calling an external program (wget), which does the writting, so each writting to disk is being performed for a different process, and I don't know if that can change something. — ABu
– ABu, Commented Oct 6, 2017 at 23:18
Try experimenting at the command-line... stackoverflow.com/a/29222049/2836621 and stackoverflow.com/a/42249033/2836621 — Mark Setchell
– Mark Setchell, Commented Oct 7, 2017 at 6:49

jschmerge · Accepted Answer · 2017-10-07 03:58:57Z

1

It is safe to do concurrent file I/O from multiple threads, but if you are concurrently writing to the same file, some form of synchronization is necessary to ensure that the writes to the file don't become interleaved.

For what you describe as your problem, it is perfectly safe to fetch each JSON blob in a separate thread and write them to different, unique files (in fact, this is probably the sanest, simplest design). Given that you mention running on a 4-core machine, I would expect to see a speed-up well past the four concurrent thread mark; network and file I/O tends to do quite a bit of blocking, so you'll probably run into a bottleneck with network I/O (or on the server's ability to send) before you hit a processing bottleneck.

Write your code so that you can control the number of threads that are spawned, and benchmark different numbers of threads. I'll guess that your sweet spot will be somewhere between 8 and 16 threads.

answered Oct 7, 2017 at 3:58

jschmerge

3201 silver badge4 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ABu Over a year ago

Exactly, in my case were 15 threads, although the virtual machine I'm using has 2 CPUs, sorry for the mistake. My underlying machine is which has 4 CPUs.

Collectives™ on Stack Overflow

Multiple threads for content retrieval (GET) and data write on disk

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related