Efficient Parallel algorithm for array filtering

Question

Given a very large array I want to select only the elements that match some condition. I know a priori the number of elements that will be matched. My current pseucode is:

filter(list):
  out = list of predetermined size
  i = 0
  for element in list
    if element matches condition
      out[i++] = element
  return out

When trying to parallelize the previous algorithm, my naive approach was to just make the increment of i atomic (It's part of an OpenMP project so used #pragma omp atomic). However this implementation has slowed performance when compared to even the serial implementation. What more efficient algorithms are there to implement this? And why am I getting such heavy slowdown?

Information from the comments:

I'm using C with OpenMP;
Currently testing with one million entries;
Serial takes ~7 seconds, Parallel with two threads takes ~15;
The output array size it's exactly half (the problem is equivalent to finding the elements of the array smaller than the median of said array);
I'm testing it with two cores only.

dreamcrash · Accepted Answer · 2021-04-07 08:45:52Z

However this implementation has slowed performance when compared to even the serial implementation. What more efficient algorithms are there to implement this?

The bottleneck is the overhead of atomic operation, therefore at first glance a more efficient algorithm would be one that would avoid the use of such operation. Although that is possible, the code would require a two step approach for instance each thread would save the elements that it has find out in a private array. After the parallel region the main thread would collect all the elements that each thread have found and merge them into a single array.

The second part of merging into a single array can be made parallel as well or using SIMD instructions.

I have created a small code to mimic your pseudocode:

#include <time.h>
#include <omp.h>

int main ()
{
  int array_size = 1000000;
  int *a = malloc(sizeof(int) * array_size); 
  int *out = malloc(sizeof(int) * array_size/2);
 
  for(int i = 0; i < array_size; i++)
      a[i] = i;   
 
  double start = omp_get_wtime();
  int i = 0;
  #pragma omp parallel for
  for (int n=0 ; n < array_size; ++n ){
      if(a[n] % 2 == 0){
         int tmp;
         #pragma omp atomic capture
         tmp = i++;
         out[tmp] = a[n]; 
      }
  }
  double end = omp_get_wtime();
  printf("%f\n",end-start);
  free(a);
  free(out);
  return 0;
}

I did an ad hoc benchmark with the following results:

-> Sequential : 0.001597 (s)
-> 2 Threads  : 0.017891 (s)
-> 4 Threads  : 0.015254 (s)

So the parallel version is much much slower, which is expectable because the work performed in parallel is simply not enough to overcome the overhead of the atomic and of the parallelism.

I have tested also removing the atomic and leaving the race-condition just to check the time:

-> Sequential : 0.001597 (s)
-> 2 Threads  : 0.001283 (s)
-> 4 Threads  : 0.000720 (s)

So without the atomic the speedup was approximately 1.2 and 2.2 for 2 and 4 threads, respectively. So naturally the atomic is causing a huge overhead. Nonetheless, the speedups are not great even without any atomic. And this is the most that you can expect from parallelizing your code alone.

Depending in your real code, and on how computation demanding is your condition, you may not achieve great speedups even with a second approach that I have mentioned.

A useful note from the comments from @Paul G:

Even if it isn't going to speed up this particular example, it might be useful to state that problems like this are generally solved in parallel computing using a (parallel) exclusive scan algorithm/prefix sum to determine which thread starts at which index of the output array. Then every thread has a private output index it is incrementing and one doesn't need atomics.

That approach might be slower than @dreamcrashs solution in this particular case but is more general as having private arrays for each thread can be very tricky (determining their size etc.) especially if your output is bigger than the input (case where each input element doesnt give 0 or 1 output element, but n output elements).

Collectives™ on Stack Overflow

Efficient Parallel algorithm for array filtering

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related