OpenMP loop runs code slower than serial loop

Question

I'm running this neat little gravity simulation and in serial execution it takes a little more than 4 minutes, when i parallelize one loop inside a it increases to about 7 minutes and if i try parallelizing more loops it increases to more than 20 minutes. I'm posting a slightly shortened version without some initializations but I think they don't matter. I'm posting the 7 minute version however with some comments where i wanted to add parallelization to loops. Thank you for helping me with my messy code.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>

#define numb 1000
int main(){
  double pos[numb][3],a[numb][3],a_local[3],v[numb][3];
  memset(v, 0.0, numb*3*sizeof(double));
  double richtung[3];
  double t,deltat=0.0,r12 = 0.0,endt=10.;
  unsigned seed;
  int tcount=0;
  #pragma omp parallel private(seed) shared(pos)
  {
    seed = 25235 + 16*omp_get_thread_num();
    #pragma omp for 
    for(int i=0;i<numb;i++){
      for(int j=0;j<3;j++){
        pos[i][j] = (double) (rand_r(&seed) % 100000 - 50000);
      }
    }
  }
  for(t=0.;t<endt;t+=deltat){
    printf("\r%le", t);
    tcount++;
    #pragma omp parallel for shared(pos,v)
    for(int id=0; id<numb; id++){
      for(int l=0;l<3;l++){
        pos[id][l] = pos[id][l]+(0.5*deltat*v[id][l]);
        v[id][l] = v[id][l]+a[id][l]*(deltat);
      }
    }
    memset(a, 0.0, numb*3*sizeof(double));
    memset(a_local, 0.0, 3*sizeof(double));
    #pragma omp parallel for private(r12,richtung) shared(a,pos)
    for(int id=0; id <numb; ++id){
      for(int id2=0; id2<id; id2++){
        for(int k=0;k<3;k++){
          r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
        }
        for(int k=0; k<3;k++){
          richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
          a[id][k] += richtung[k]/(((r12)*(r12)));
          a_local[k] += (-1.0)*richtung[k]/(((r12)*(r12)));
          #pragma omp critical
          {
            a[id2][k] += a_local[k];
          }
        }
        r12=0.0;
      }
    }
    #pragma omp parallel for shared(pos)
    for(int id =0; id<numb; id++){
      for(int k=0;k<3;k++){
        pos[id][k] = pos[id][k]+(0.5*deltat*v[id][k]);
      }
    }
    deltat= 0.01;
  }
  return 0;
}

I'm using g++ -fopenmp -o test_grav test_grav.c to compile the code and I'm measuring time in the shell just by time ./test_grav. When I used get_numb_threads() to get the number of threads it displayed 4. top also shows more than 300% (sometimes ~380%) cpu usage. Interesting little fact if I start the parallel region before the time-loop (meaning the most outer for-loop) and without any actual #pragma omp for it is equivalent to making one parallel region for every major (the three second to most outer loops) loop. So I think it is an optimization thing, but I don't know how to solve it. Can anyone help me?

Edit: I made the example verifiable and lowered numbers like numb to make it better testable but the problem still occurs. Even when I remove the critical region as suggested by TheQuantumPhysicist, just not as severely.

The critical section looks evil. Can't you just rerun the loop and keep that critical section outside with no parallelization? — The Quantum Physicist
– The Quantum Physicist, Commented Mar 6, 2017 at 9:28
The critical region solves a race-condition with my acceleration a[id2][0,1,2] like a reduction onto an array. And i do need the id2 loop so I'm writing to the right forces int a. — Haemiltoen
– Haemiltoen, Commented Mar 6, 2017 at 9:43
Where is richtung defined? It makes a big difference if it's an array or a pointer. If it's an array then OpenMP will make private arrays for each thread (like you want). If it's a pointer then you only get a private pointer for each thread. Have you checked that the parallel version gets the same answer? You do ` r12=0.0;` at the end of your loop so the initial value of r12 is undefined for each thread. — Z boson
– Z boson, Commented Mar 6, 2017 at 9:45
How big is numb. You need to do enough work to overcome the OpenMP overhead. — Z boson
– Z boson, Commented Mar 6, 2017 at 9:47
@Haemiltoen I understand, but you probably didn't understand my alternative solution. Just create another loop outside that parallelized loop and don't use OpenMP there. Would that work for you? If it does, it'll definitely be much better than whatever you have there. Critical sections mean you're using locking and mutexes, which by definition will slow down your code. — The Quantum Physicist
– The Quantum Physicist, Commented Mar 6, 2017 at 9:52

The Quantum Physicist · Accepted Answer · 2017-03-09 15:24:01Z

1

I believe that critical section is the cause of the problem. Consider taking all critical sections outside the parallelized loop and running them after the parallelization is over.

Try this:

#pragma omp parallel shared(a,pos)
{
#pragma omp for private(id2,k,r12,richtung,a_local) 
for(id=0; id <numb; ++id){
    for(id2=0; id2<id; id2++){
        for(k=0;k<3;k++){
            r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
        }
        for(k =0; k<3;k++){
            richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
            a[id][k] += richtung[k]/(((r12)*(r12))+epsilon);
            a_local[k]+= richtung[k]/(((r12)*(r12))+epsilon)*(-1.0);
        }
    }
}
}
for(id=0; id <numb; ++id){
    for(id2=0; id2<id; id2++){
        for(k=0;k<3;k++){
            a[id2][k] += a_local[k];
        }
    }
}

Critical sections will lead to locking and blocking. If you can keep these sections linear, you'll win a lot in performance.

Notice that I'm talking about a syntactic solution, which I don't know whether it works for your case. But to be clear: If every point in your series depends on the next one, then parallelizing is not a solution for you; at least simple parallelization using OpenMP.

edited Mar 9, 2017 at 15:24

answered Mar 6, 2017 at 9:54

The Quantum Physicist

26.6k20 gold badges125 silver badges214 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Haemiltoen Over a year ago

yes, this will solve the critical section problem just instead of a_local[k] i'll have to use a[id][k]*(-1.0) and it'll mean that i run two loops with (1000^2 -1000)/2 iterations instead of one.

The Quantum Physicist Over a year ago

@Haemiltoen Then this is definitely faster. Do it, and see what happens. Keep in mind the number of iterations doesn't really matter. What matters is what you do inside the loop.

Haemiltoen Over a year ago

@TheQuantimPhysicist Well now it runs faster than with the critical region but with 5 minutes still 1minute slower than in serial. But i can't parallelize the second loop or I'll get a race-condition which i would have to solve via some sort of critical region or reduction to an array which should be the same. But thank you for your help I think I learned a few thing about OpenMP.

The Quantum Physicist Over a year ago

@Haemiltoen You're welcome. If this issue is closed, select the answer above through the check mark.

The Quantum Physicist Over a year ago

@Haemiltoen why did you remove the check mark? What's going on?

|

Collectives™ on Stack Overflow

OpenMP loop runs code slower than serial loop

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related