1

I have a particle simulation in C which is split over 4 MPI processes and running fast (compared to serial). However, one region of my implementation is N^2 complexity, where I need to compare each particle against every other particle in that process plus 'border' particles shared from other processes. My plan to speed it up was to parallelize the outer loop with #pragma omp parallel for but every variation of OpenMP pragmas I have tried has resulted in severe slow down of my simulation, due to the nested loop in question taking significantly longer.

I have tried using schedule and reduction when starting the parallel region which didn't do much. I have also tried with 8 or 4 threads, which also didn't help. Plus a range of system sizes (to check if the speedup 'kicked in' once the overhead was worth it - it has not) and compiler optimizations.

Timing reveals that one of my four MPI processes is taking a very long time through this section which is slowing down the whole simulation as they need to wait at the end of each time-step before going to the next. The work is quite evenly distributed so none of the processes should take longer.

Code snippet (roughly copied, but have made some aspects pseudo as their details are unimportant):

double calc_1 = omp_get_wtime();
#pragma omp parallel for num_threads(4) private(distances, particle_one)
for (int p = 0; p < myNumParticles; p = p + 2){
    //Particle* particle_iter = &particles[0];
    // printf("p%d: I am thread %d\n", my_rank, omp_get_thread_num());
    // fflush(stdout);
    double force_x = 0;
    double force_y = 0;
    particle_one[0] = positions[p];
    particle_one[1] = positions[p+1];
    //#pragma omp parallel shared(force_x, force_y)
    // {
    if (particle_one real){
        // for (int i = 0; i < (myNumParticles-8); i = i + 8){   
        //#pragma omp parallel for private(distances)
        for (int i = 0; i < myNumParticles; i += 2){
            if (i != p){
                // lj_counter2++;
                particle_two[0] = positions[i];
                particle_two[1] = positions[i+1];
                if (particle within distance cutoff){
                    force_function(); // modifies forces array
                    force_x += (forces[0]);
                    force_y += (forces[1]);
                } 
                distances[0] = 0;
                distances[1] = 0;
            }
        }
        // repeat conparison against border particles
        for (int j = 0; j < (num_particles_local); j += 2){
            particle_three[0] = myBorderParticles[j];
            particle_three[1] = myBorderParticles[j+1];
            if (particle is real){
                if (within distance cutoff) {
                    force_function(); // modifies forces array
                    force_x += (forces[0]);
                    force_y += (forces[1]);
                } 
                distances[0] = 0;
                distances[1] = 0;
            } else {
                // if particle with position {0, 0} found, you're at the end
                break;
            }
        }
    }
    accelerations[p] = force_x;
    accelerations[p+1] = force_y;
}
double calc_2 = omp_get_wtime();
calc_time += (calc_2-calc_1);

Timing results for a 3600 particle system:

  • MPI only:
    373.6 seconds (of which 332.47-349.08 is spent in the above nested loop)

  • MPI plus openMP:
    1606.2 seconds (of which 583.09-1579.96 is spent in the nested loop, depending on the process)

Further timing down below for some smaller systems:

Particles MPI MPI + opemMP
225 27 30
400 37 33
625 60 101
900 99 293
1225 153 490
1600 239 735

I am running it across 4 nodes with 28+ cores each and I'm not requesting an unreasonable amount of memory. It is batched with Slurm and run with mpiexec -n 4 ./executable num_particles box_size > slurms/output.txt.

17
  • 2
    There is almost certainly something wrong with the core binding of threads. 1. try and include --cpus-per-task=… in your sbatch script. 2. Instead of explicit num_threads() in the pragma, run OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpiexec -n $SLURM_NTASKS Commented Nov 3, 2024 at 23:24
  • Thanks @Homer512. Sorry I did not include that detail, but my slurm script does have cpus-per-task set to 4 and I have manually printed/checked that all the threads are created with printf("thread %d\n", omp_get_thread_num()); fflush(stdout) Commented Nov 4, 2024 at 0:50
  • 1
    there are some fishy thing with your snippet. particle_two and particle_three are set but never used. They should at least be declared private. Setting accelerations[p+1 looks suspicious too. Are you sure you did not want to set accelerations[2*p] and accelerations[2*p+1]? or accelerations[p] and accelerations[p+MyNumParticles]? Commented Nov 4, 2024 at 4:45
  • 2
    as a side note, it is a good practice to add a default(none) clause in each OpenMP section since it can help preventing some bugs. Commented Nov 4, 2024 at 7:09
  • 2
    @GillesGouaillardet the for loop iterates on p with step 2, so accelerate[p+1] seems to be fine. Using a struct {x,y} would improve readability and avoid such confusion. Commented Nov 4, 2024 at 7:53

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.