code that may not be directly parallelized by OpenMP

Question

Working on autoparallelizer and looking for benchmarks that may not be directly parallelized by OpenMP ( "directly parallelized" meaning: to create parallel executable without modifying the code, just by specifying proper OpenMP directives ).

Being on the subject, is it possible to parallelize the following code using OpenMP?

1.

double a[ARRAY_DIM], c[ARRAY_DIM];
  .
double ret;
ret = 0.;
for ( i = 0; i < ARRAY_DIM; i++ ) {
  c[i] = exp( ret );
  ret += a[i];
}
return ret;

2.

double a[ARRAY_DIM], c[ARRAY_DIM];
   .
double ret;
ret = 0.;
for ( i = 0; i < ARRAY_DIM; i++ ) {
  if ( a[i] > 0.01 ) {
    ret = c[i];
  }
}
return ret;

Yes, by adding the proper #pragma omp, you don't have to change your code. You can optimize further by changing your code but (usually) adding #pragma is already giving improvement. — supercheval
– supercheval, Commented Jul 30, 2018 at 15:03
Neither of your code examples are easily to parallelize with OpenMP. That does not mean it can't be done but it can't be done with one pragma statement. The first case is doing a cumulative sum/prefix sum. The second case requires knowing the last iteration to modify ret i.e. the result is dependent on i. — Z boson
– Z boson, Commented Jul 31, 2018 at 7:29
Actually the second case can done with a single pragma using the ordered clause. — Z boson
– Z boson, Commented Jul 31, 2018 at 7:38
Why did you reinsert . into your code? Also why do you insist on having return when you did not even define a function? — Z boson
– Z boson, Commented Aug 3, 2018 at 9:22

Z boson · Accepted Answer · 2018-08-01 07:16:46Z

0

Code example 1 can be parallelized with OpenMP using a prefix sum solution.

For code example 2 I have two solutions in the code bellow called foo1 and foo2. The function foo1 is easier to implement but is probably much less efficient if a[i] > 0.01 is frequently true. For example if it's always true then it has to write to ret every iteration in order which kills parallelization. For the function foo2 it only has to write in order per thread and not per iteration and since the number of iterations is much greater than the number of threads this is much more efficient.

#include <omp.h>
#include <stdio.h>
#define ARRAY_DIM 1000
double a[ARRAY_DIM], c[ARRAY_DIM];

double foo1() {
  double ret = 0;
  #pragma omp parallel for ordered schedule(static)
  for (int i = 0; i < ARRAY_DIM; i++) {
    if ( a[i] > 0.01 ) {
      #pragma omp ordered
      ret = c[i];
    }
  }
  return ret;
}

double foo2() {
  int k = -1;
  #pragma omp parallel
  {
    int k2 = -1;
    #pragma omp for schedule(static)
    for (int i = 0; i < ARRAY_DIM; i++) if ( a[i] > 0.01 ) k2 = i;
    #pragma omp for schedule(static) ordered
    for(int i=0; i<omp_get_num_threads(); i++)
      #pragma omp ordered
      if(k2 >= 0) k = k2;
  }
  return c[k];
}

int main(void) {
  for(int i=0; i<ARRAY_DIM; i++) a[i] = 1;
  for(int i=0; i<ARRAY_DIM; i++) c[i] = i;
  printf("%f\n", foo1());
  printf("%f\n", foo2());
}

edited Aug 1, 2018 at 7:16

answered Jul 31, 2018 at 8:08

Z boson

34k14 gold badges132 silver badges238 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

David Livshin Over a year ago

Thank you. Solutions you suggested for example 1 and foo2 require modifications to the original code. The deficiency of foo1 you pointed yourself.

Z boson Over a year ago

@DavidLivshin Your original text did not state that you required the code not to be modified. I provided an answer because I found a solution which does not require changing the code (foo1). That answers your question. I provided foo2 to show hot to do it better. foo1 is good enough if a[i] > 0.01 is usually false. I don't know if better solutions which don't require changing the code.

Z boson Over a year ago

@DavidLivshin I'm skeptical to your claims that "The two examples I listed can be treated without any effort by the autoparallelizer I wrote." I'm especially skeptical that your autoparallelizer would provided better performance in these cases.

David Livshin Over a year ago

The parallelizer I wrote ( dco ) treats the listed code samples with no effort; actually it is able to parallelize much more complicated cases. See an appropriate section in www.dalsoft.com. The site will be updated soon, I am looking for interesting benchmarks to add to already processed ( but not yet published on the site ) - that is why this posting.

Z boson Over a year ago

@DavidLivshin, interesting project. How does dco handle case one. Do you see any performance improvement in either case?

|

Collectives™ on Stack Overflow

code that may not be directly parallelized by OpenMP

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related