Overlaying openMP onto MPI program causes slow down of the region parallelised with openMP

Ask Question

Asked 1 year ago

Modified 1 year ago

Viewed 70 times

I have a particle simulation in C which is split over 4 MPI processes and running fast (compared to serial). However, one region of my implementation is N^2 complexity, where I need to compare each particle against every other particle in that process plus 'border' particles shared from other processes. My plan to speed it up was to parallelize the outer loop with #pragma omp parallel for but every variation of OpenMP pragmas I have tried has resulted in severe slow down of my simulation, due to the nested loop in question taking significantly longer.

I have tried using schedule and reduction when starting the parallel region which didn't do much. I have also tried with 8 or 4 threads, which also didn't help. Plus a range of system sizes (to check if the speedup 'kicked in' once the overhead was worth it - it has not) and compiler optimizations.

Timing reveals that one of my four MPI processes is taking a very long time through this section which is slowing down the whole simulation as they need to wait at the end of each time-step before going to the next. The work is quite evenly distributed so none of the processes should take longer.

Code snippet (roughly copied, but have made some aspects pseudo as their details are unimportant):

double calc_1 = omp_get_wtime();
#pragma omp parallel for num_threads(4) private(distances, particle_one)
for (int p = 0; p < myNumParticles; p = p + 2){
    //Particle* particle_iter = &particles[0];
    // printf("p%d: I am thread %d\n", my_rank, omp_get_thread_num());
    // fflush(stdout);
    double force_x = 0;
    double force_y = 0;
    particle_one[0] = positions[p];
    particle_one[1] = positions[p+1];
    //#pragma omp parallel shared(force_x, force_y)
    // {
    if (particle_one real){
        // for (int i = 0; i < (myNumParticles-8); i = i + 8){   
        //#pragma omp parallel for private(distances)
        for (int i = 0; i < myNumParticles; i += 2){
            if (i != p){
                // lj_counter2++;
                particle_two[0] = positions[i];
                particle_two[1] = positions[i+1];
                if (particle within distance cutoff){
                    force_function(); // modifies forces array
                    force_x += (forces[0]);
                    force_y += (forces[1]);
                } 
                distances[0] = 0;
                distances[1] = 0;
            }
        }
        // repeat conparison against border particles
        for (int j = 0; j < (num_particles_local); j += 2){
            particle_three[0] = myBorderParticles[j];
            particle_three[1] = myBorderParticles[j+1];
            if (particle is real){
                if (within distance cutoff) {
                    force_function(); // modifies forces array
                    force_x += (forces[0]);
                    force_y += (forces[1]);
                } 
                distances[0] = 0;
                distances[1] = 0;
            } else {
                // if particle with position {0, 0} found, you're at the end
                break;
            }
        }
    }
    accelerations[p] = force_x;
    accelerations[p+1] = force_y;
}
double calc_2 = omp_get_wtime();
calc_time += (calc_2-calc_1);

Timing results for a 3600 particle system:

MPI only:
373.6 seconds (of which 332.47-349.08 is spent in the above nested loop)
MPI plus openMP:
1606.2 seconds (of which 583.09-1579.96 is spent in the nested loop, depending on the process)

Further timing down below for some smaller systems:

Particles	MPI	MPI + opemMP
225	27	30
400	37	33
625	60	101
900	99	293
1225	153	490
1600	239	735

I am running it across 4 nodes with 28+ cores each and I'm not requesting an unreasonable amount of memory. It is batched with Slurm and run with mpiexec -n 4 ./executable num_particles box_size > slurms/output.txt.

asked Nov 3, 2024 at 23:16

Luna Morrow

112 bronze badges

2

There is almost certainly something wrong with the core binding of threads. 1. try and include --cpus-per-task=… in your sbatch script. 2. Instead of explicit num_threads() in the pragma, run OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpiexec -n $SLURM_NTASKS

Homer512
– Homer512

2024-11-03 23:24:21 +00:00
Commented Nov 3, 2024 at 23:24
Thanks @Homer512. Sorry I did not include that detail, but my slurm script does have cpus-per-task set to 4 and I have manually printed/checked that all the threads are created with printf("thread %d\n", omp_get_thread_num()); fflush(stdout)

Luna Morrow
– Luna Morrow

2024-11-04 00:50:58 +00:00
Commented Nov 4, 2024 at 0:50
1

there are some fishy thing with your snippet. particle_two and particle_three are set but never used. They should at least be declared private. Setting accelerations[p+1 looks suspicious too. Are you sure you did not want to set accelerations[2*p] and accelerations[2*p+1]? or accelerations[p] and accelerations[p+MyNumParticles]?

Gilles Gouaillardet
– Gilles Gouaillardet

2024-11-04 04:45:19 +00:00
Commented Nov 4, 2024 at 4:45
2

as a side note, it is a good practice to add a default(none) clause in each OpenMP section since it can help preventing some bugs.

Gilles Gouaillardet
– Gilles Gouaillardet

2024-11-04 07:09:59 +00:00
Commented Nov 4, 2024 at 7:09
2

@GillesGouaillardet the for loop iterates on p with step 2, so accelerate[p+1] seems to be fine. Using a struct {x,y} would improve readability and avoid such confusion.

Joachim
– Joachim

2024-11-04 07:53:10 +00:00
Commented Nov 4, 2024 at 7:53

| Show 12 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Overlaying openMP onto MPI program causes slow down of the region parallelised with openMP

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest