OpenMP for-loop parallelism issue

Question

I am new to OpenMP and I want to divide for-loop iterations into equal chunks. I have achievied this so far:

#pragma omp parallel for schedule(static, 2) reduction(+:tot_ext) 
for (int i = 0;i<num_pos;i++) {

    if (fscanf(stdin,"%lu,%lu\n", &from, &to) != 2) {
        fprintf(stderr, "Cannot read correctly intervals file\n");
        exit(1);
    }

    time = getTime();
    text = (uchar*)malloc(to-from+2);

    readlen = sdsl::extract(csa, from, to, text);
    tot_time += (getTime() - time);

    tot_ext += readlen;

    if (Verbose) {
        fwrite(&from,sizeof(ulong),1,stdout);
        fwrite(&readlen,sizeof(ulong),1,stdout);
        fwrite(text,sizeof(uchar),readlen, stdout);
    }

    free(text);
}

Time it takes to run this query on one core: 2.72 sec.
Time it takes to run this query on two cores: 2.64 sec.

My question is: Why the difference is so small?

What compiler flags did you use? Was Verbose true? How large is num_pos? Those are generally questions one would ask. But in your case, we have several issues, you use read from a file using fscanf (even though you're using C++). It could be that you spend most of your time reading from disk. But perhaps most importantly, how did you measure this time? If you used tot_time then you should note that BOTH threads increment the value (not safely, but still), so the time stored there should be roughly double the actual time. — Qubit
– Qubit, Commented Oct 16, 2018 at 11:46
Thank you for your answer. I specify none compiler flags. Num_pos is currently set to 10000. The code is my slight modification of sdsl-lite library which contains benchmark for counting, locating and extrating string from different text files. — Tomoss
– Tomoss, Commented Oct 16, 2018 at 12:00
You really should use -O3 if speed is what you're after. We still need to know how you measure time though. — Qubit
– Qubit, Commented Oct 16, 2018 at 14:07
File operations are probably serialized (for good reason) in your run-time library. In order to get parallelism, you need a parallel file system and function library. It would not be portable. In the simplest possibility, you would need a file handle for each disk and each thread. — tim18
– tim18, Commented Oct 18, 2018 at 12:52

Rags · Accepted Answer · 2018-10-16 12:11:08Z

OpenMP sets up a thread pool, and at the point where the split happens, work is divided between (some of) these threads, according the circumstances.

There is a significant cost involved in apportioning the work, starting the worker threads, and joining them all at the end.

Unless there is a really significant amount of work to be done in each loop iteration, the overhead costs of managing the threads can vastly outweigh the benefits - it can be many times slower to try to parallelise a small loop than to just run it through on a single processor, which can take advantage of register reuse and local core cache.

If you call certain system functions, such as fwrite, they will likely force synchronisation between threads as well, which will affect results.

For a fairer test, try having blocks of work which are:

self-contained
don't use synchronised calls
are substantial pieces of work in their own right

This will allow you test the possible advantages of parallelism in a more applicable way.

PilouPili · Accepted Answer · 2018-10-16 15:01:54Z

2

First remark

Quote from the OpenMP forum (http://forum.openmp.org/forum/viewtopic.php?f=3&t=764 )

In general reading and writing from a file from multiple threads is not a good idea unless the underlying operating system or I/O system really supports it (commonly referred to as parallel I/O).
A lot of times the I/O routines can be used to read/write from separate files at the same time. That is because each I/O instance is keeping it's own data area telling which line it is reading or writing from. However, when you have multiple threads trying to read/write to the same file you need the I/O system to keep one data area for that file for all threads. This is not what is generally done, so you end up serializing (or somehow locking) the I/O. This of course, slows done the process and thus makes parallelizing I/O relatively impractical (since you will not see a speedup)

This applies also to standard output

So best case fscanf and fwrite part will be sequential, worst case it will be a mess.

Second Remark

As

• OpenMP threads share a single executable, global memory, and heap (malloc, new)

Therefore malloc must be in a omp critical section (I do not know if the compiler does it automatically.)

So malloc part is sequential.

Last Remark:

tot_ext is reduced so you can't expect full scalability here.

Conclusion

In the end the only part really treated in parallel is readlen = sdsl::extract(csa, from, to, text); if it is not the most expensive operation, your timing is consistent.

answered Oct 16, 2018 at 15:01

PilouPili

2,7192 gold badges20 silver badges31 bronze badges

1 Comment

Zulan Over a year ago

malloc is typically thread safe. glibc does some tricks to avoid contention. Of course efficient memory allocation is a very complex topic and it should be avoided to have malloc in a tight loop anyway.

Trantor · Accepted Answer · 2018-10-16 14:11:09Z

1

Regarding only performance, not thread-safety - the problem of this test is that you are using things like stdin, getTime, malloc, fwrite, free etc. . Most of these functions are internally implemented with operating system calls causing unpredictable delays and synchronisations. Try to avoid such calls, and you will experience better results.

edited Oct 16, 2018 at 14:11

answered Oct 16, 2018 at 11:43

Trantor

7761 gold badge9 silver badges29 bronze badges

Comments

Tomoss · Accepted Answer · 2018-10-18 12:05:01Z

1

Problem solved. Main issue of why my measured times were bad was because of using getTime() method which measures time spend on every thread and sums it up. Using omp_get_wtime() I was able to measure time of the operation itself. The results are the following:

1 thread: 2.72 sec
2 threads: 1.41 sec
3 threads: 0.94 sec
4 threads: 0.73 sec

Thank you all for your answers. Based on your suggestions I managed to clean up my code a bit so that its more parrarel-efficient now :)

answered Oct 18, 2018 at 12:05

Tomoss

314 bronze badges

1 Comment

Rags Over a year ago

Good that you've managed to structure it in a sensible way, and that you can see the results of executing in parallel (and also, it appears, the diminishing returns as you add more threads.) You might consider editing your answer to show the code that produces these results - others might find it instructive.

Collectives™ on Stack Overflow

OpenMP for-loop parallelism issue

4 Answers 4

Comments

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related