0

code that may not be directly parallelized by OpenMP

Working on autoparallelizer and looking for benchmarks that may not be directly parallelized by OpenMP ( "directly parallelized" meaning: to create parallel executable without modifying the code, just by specifying proper OpenMP directives ).

Being on the subject, is it possible to parallelize the following code using OpenMP?

1.

double a[ARRAY_DIM], c[ARRAY_DIM];
  .
double ret;
ret = 0.;
for ( i = 0; i < ARRAY_DIM; i++ ) {
  c[i] = exp( ret );
  ret += a[i];
}
return ret;

2.

double a[ARRAY_DIM], c[ARRAY_DIM];
   .
double ret;
ret = 0.;
for ( i = 0; i < ARRAY_DIM; i++ ) {
  if ( a[i] > 0.01 ) {
    ret = c[i];
  }
}
return ret;
6
  • Yes, by adding the proper #pragma omp, you don't have to change your code. You can optimize further by changing your code but (usually) adding #pragma is already giving improvement. Commented Jul 30, 2018 at 15:03
  • Neither of your code examples are easily to parallelize with OpenMP. That does not mean it can't be done but it can't be done with one pragma statement. The first case is doing a cumulative sum/prefix sum. The second case requires knowing the last iteration to modify ret i.e. the result is dependent on i. Commented Jul 31, 2018 at 7:29
  • Actually the second case can done with a single pragma using the ordered clause. Commented Jul 31, 2018 at 7:38
  • Actually two pragmas. Commented Jul 31, 2018 at 8:10
  • Why did you reinsert . into your code? Also why do you insist on having return when you did not even define a function? Commented Aug 3, 2018 at 9:22

1 Answer 1

0

Code example 1 can be parallelized with OpenMP using a prefix sum solution.

For code example 2 I have two solutions in the code bellow called foo1 and foo2. The function foo1 is easier to implement but is probably much less efficient if a[i] > 0.01 is frequently true. For example if it's always true then it has to write to ret every iteration in order which kills parallelization. For the function foo2 it only has to write in order per thread and not per iteration and since the number of iterations is much greater than the number of threads this is much more efficient.

#include <omp.h>
#include <stdio.h>
#define ARRAY_DIM 1000
double a[ARRAY_DIM], c[ARRAY_DIM];

double foo1() {
  double ret = 0;
  #pragma omp parallel for ordered schedule(static)
  for (int i = 0; i < ARRAY_DIM; i++) {
    if ( a[i] > 0.01 ) {
      #pragma omp ordered
      ret = c[i];
    }
  }
  return ret;
}

double foo2() {
  int k = -1;
  #pragma omp parallel
  {
    int k2 = -1;
    #pragma omp for schedule(static)
    for (int i = 0; i < ARRAY_DIM; i++) if ( a[i] > 0.01 ) k2 = i;
    #pragma omp for schedule(static) ordered
    for(int i=0; i<omp_get_num_threads(); i++)
      #pragma omp ordered
      if(k2 >= 0) k = k2;
  }
  return c[k];
}

int main(void) {
  for(int i=0; i<ARRAY_DIM; i++) a[i] = 1;
  for(int i=0; i<ARRAY_DIM; i++) c[i] = i;
  printf("%f\n", foo1());
  printf("%f\n", foo2());
}
Sign up to request clarification or add additional context in comments.

10 Comments

Thank you. Solutions you suggested for example 1 and foo2 require modifications to the original code. The deficiency of foo1 you pointed yourself.
@DavidLivshin Your original text did not state that you required the code not to be modified. I provided an answer because I found a solution which does not require changing the code (foo1). That answers your question. I provided foo2 to show hot to do it better. foo1 is good enough if a[i] > 0.01 is usually false. I don't know if better solutions which don't require changing the code.
@DavidLivshin I'm skeptical to your claims that "The two examples I listed can be treated without any effort by the autoparallelizer I wrote." I'm especially skeptical that your autoparallelizer would provided better performance in these cases.
The parallelizer I wrote ( dco ) treats the listed code samples with no effort; actually it is able to parallelize much more complicated cases. See an appropriate section in www.dalsoft.com. The site will be updated soon, I am looking for interesting benchmarks to add to already processed ( but not yet published on the site ) - that is why this posting.
@DavidLivshin, interesting project. How does dco handle case one. Do you see any performance improvement in either case?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.