However this implementation has slowed performance when compared to
even the serial implementation. What more efficient algorithms are
there to implement this?
The bottleneck is the overhead of atomic operation, therefore at first glance a more efficient algorithm would be one that would avoid the use of such operation. Although that is possible, the code would require a two step approach for instance each thread would save the elements that it has find out in a private array. After the parallel region the main thread would collect all the elements that each thread have found and merge them into a single array.
The second part of merging into a single array can be made parallel as well or using SIMD instructions.
I have created a small code to mimic your pseudocode:
#include <time.h>
#include <omp.h>
int main ()
{
int array_size = 1000000;
int *a = malloc(sizeof(int) * array_size);
int *out = malloc(sizeof(int) * array_size/2);
for(int i = 0; i < array_size; i++)
a[i] = i;
double start = omp_get_wtime();
int i = 0;
#pragma omp parallel for
for (int n=0 ; n < array_size; ++n ){
if(a[n] % 2 == 0){
int tmp;
#pragma omp atomic capture
tmp = i++;
out[tmp] = a[n];
}
}
double end = omp_get_wtime();
printf("%f\n",end-start);
free(a);
free(out);
return 0;
}
I did an ad hoc benchmark with the following results:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.017891 (s)
-> 4 Threads : 0.015254 (s)
So the parallel version is much much slower, which is expectable because the work performed in parallel is simply not enough to overcome the overhead of the atomic and of the parallelism.
I have tested also removing the atomic and leaving the race-condition just to check the time:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.001283 (s)
-> 4 Threads : 0.000720 (s)
So without the atomic the speedup was approximately 1.2 and 2.2 for 2 and 4 threads, respectively. So naturally the atomic is causing a huge overhead. Nonetheless, the speedups are not great even without any atomic. And this is the most that you can expect from parallelizing your code alone.
Depending in your real code, and on how computation demanding is your condition, you may not achieve great speedups even with a second approach that I have mentioned.
A useful note from the comments from @Paul G:
Even if it isn't going to speed up this particular example, it might
be useful to state that problems like this are generally solved in
parallel computing using a (parallel) exclusive scan algorithm/prefix
sum to determine which thread starts at which index of the output
array. Then every thread has a private output index it is incrementing
and one doesn't need atomics.
That approach might be slower than @dreamcrashs solution in this
particular case but is more general as having private arrays for each
thread can be very tricky (determining their size etc.) especially if
your output is bigger than the input (case where each input element
doesnt give 0 or 1 output element, but n output elements).