2

I'm running this neat little gravity simulation and in serial execution it takes a little more than 4 minutes, when i parallelize one loop inside a it increases to about 7 minutes and if i try parallelizing more loops it increases to more than 20 minutes. I'm posting a slightly shortened version without some initializations but I think they don't matter. I'm posting the 7 minute version however with some comments where i wanted to add parallelization to loops. Thank you for helping me with my messy code.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>

#define numb 1000
int main(){
  double pos[numb][3],a[numb][3],a_local[3],v[numb][3];
  memset(v, 0.0, numb*3*sizeof(double));
  double richtung[3];
  double t,deltat=0.0,r12 = 0.0,endt=10.;
  unsigned seed;
  int tcount=0;
  #pragma omp parallel private(seed) shared(pos)
  {
    seed = 25235 + 16*omp_get_thread_num();
    #pragma omp for 
    for(int i=0;i<numb;i++){
      for(int j=0;j<3;j++){
        pos[i][j] = (double) (rand_r(&seed) % 100000 - 50000);
      }
    }
  }
  for(t=0.;t<endt;t+=deltat){
    printf("\r%le", t);
    tcount++;
    #pragma omp parallel for shared(pos,v)
    for(int id=0; id<numb; id++){
      for(int l=0;l<3;l++){
        pos[id][l] = pos[id][l]+(0.5*deltat*v[id][l]);
        v[id][l] = v[id][l]+a[id][l]*(deltat);
      }
    }
    memset(a, 0.0, numb*3*sizeof(double));
    memset(a_local, 0.0, 3*sizeof(double));
    #pragma omp parallel for private(r12,richtung) shared(a,pos)
    for(int id=0; id <numb; ++id){
      for(int id2=0; id2<id; id2++){
        for(int k=0;k<3;k++){
          r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
        }
        for(int k=0; k<3;k++){
          richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
          a[id][k] += richtung[k]/(((r12)*(r12)));
          a_local[k] += (-1.0)*richtung[k]/(((r12)*(r12)));
          #pragma omp critical
          {
            a[id2][k] += a_local[k];
          }
        }
        r12=0.0;
      }
    }
    #pragma omp parallel for shared(pos)
    for(int id =0; id<numb; id++){
      for(int k=0;k<3;k++){
        pos[id][k] = pos[id][k]+(0.5*deltat*v[id][k]);
      }
    }
    deltat= 0.01;
  }
  return 0;
}

I'm using g++ -fopenmp -o test_grav test_grav.c to compile the code and I'm measuring time in the shell just by time ./test_grav. When I used get_numb_threads() to get the number of threads it displayed 4. top also shows more than 300% (sometimes ~380%) cpu usage. Interesting little fact if I start the parallel region before the time-loop (meaning the most outer for-loop) and without any actual #pragma omp for it is equivalent to making one parallel region for every major (the three second to most outer loops) loop. So I think it is an optimization thing, but I don't know how to solve it. Can anyone help me?

Edit: I made the example verifiable and lowered numbers like numb to make it better testable but the problem still occurs. Even when I remove the critical region as suggested by TheQuantumPhysicist, just not as severely.

15
  • The critical section looks evil. Can't you just rerun the loop and keep that critical section outside with no parallelization? Commented Mar 6, 2017 at 9:28
  • The critical region solves a race-condition with my acceleration a[id2][0,1,2] like a reduction onto an array. And i do need the id2 loop so I'm writing to the right forces int a. Commented Mar 6, 2017 at 9:43
  • Where is richtung defined? It makes a big difference if it's an array or a pointer. If it's an array then OpenMP will make private arrays for each thread (like you want). If it's a pointer then you only get a private pointer for each thread. Have you checked that the parallel version gets the same answer? You do ` r12=0.0;` at the end of your loop so the initial value of r12 is undefined for each thread. Commented Mar 6, 2017 at 9:45
  • How big is numb. You need to do enough work to overcome the OpenMP overhead. Commented Mar 6, 2017 at 9:47
  • @Haemiltoen I understand, but you probably didn't understand my alternative solution. Just create another loop outside that parallelized loop and don't use OpenMP there. Would that work for you? If it does, it'll definitely be much better than whatever you have there. Critical sections mean you're using locking and mutexes, which by definition will slow down your code. Commented Mar 6, 2017 at 9:52

1 Answer 1

1

I believe that critical section is the cause of the problem. Consider taking all critical sections outside the parallelized loop and running them after the parallelization is over.

Try this:

#pragma omp parallel shared(a,pos)
{
#pragma omp for private(id2,k,r12,richtung,a_local) 
for(id=0; id <numb; ++id){
    for(id2=0; id2<id; id2++){
        for(k=0;k<3;k++){
            r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
        }
        for(k =0; k<3;k++){
            richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
            a[id][k] += richtung[k]/(((r12)*(r12))+epsilon);
            a_local[k]+= richtung[k]/(((r12)*(r12))+epsilon)*(-1.0);
        }
    }
}
}
for(id=0; id <numb; ++id){
    for(id2=0; id2<id; id2++){
        for(k=0;k<3;k++){
            a[id2][k] += a_local[k];
        }
    }
}

Critical sections will lead to locking and blocking. If you can keep these sections linear, you'll win a lot in performance.

Notice that I'm talking about a syntactic solution, which I don't know whether it works for your case. But to be clear: If every point in your series depends on the next one, then parallelizing is not a solution for you; at least simple parallelization using OpenMP.

Sign up to request clarification or add additional context in comments.

6 Comments

yes, this will solve the critical section problem just instead of a_local[k] i'll have to use a[id][k]*(-1.0) and it'll mean that i run two loops with (1000^2 -1000)/2 iterations instead of one.
@Haemiltoen Then this is definitely faster. Do it, and see what happens. Keep in mind the number of iterations doesn't really matter. What matters is what you do inside the loop.
@TheQuantimPhysicist Well now it runs faster than with the critical region but with 5 minutes still 1minute slower than in serial. But i can't parallelize the second loop or I'll get a race-condition which i would have to solve via some sort of critical region or reduction to an array which should be the same. But thank you for your help I think I learned a few thing about OpenMP.
@Haemiltoen You're welcome. If this issue is closed, select the answer above through the check mark.
@Haemiltoen why did you remove the check mark? What's going on?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.