I developed a code parallelized in a hybrid way based on OpenMPI + OpenMP. It works as I expect if 'enough' number of MPI processors are given. So far based on tests, I would roughly say 'enough' means more than two MPI processors.
A problem I observe is that if the code is allocated only one or two MPI processors, then the multi-threading via OpenMP does not work as expected, but stuck by 200% CPU usage (i.e., uses only 2-threads), not more. It is very unclear why this happens.
Here is information about running environment;
Ubuntu 20.04.4 LTS, gfortran 13.2.0, openmpi 4.1.5
To provide reporduction of my issue, here is a toy code that replicates the same issue;
program parallel_example
use OMP_LIB
implicit none
include 'mpif.h'
integer :: i, j, k, n, ierror, size_Of_Cluster, process_Rank
real :: sum, x
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size_Of_Cluster, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, process_Rank, ierror)
call omp_set_dynamic(.False.)
call omp_set_num_threads(5)
!$OMP PARALLEL
print *, 'hello from thread:', OMP_GET_THREAD_NUM(), &
& 'of proc=', process_Rank
!$OMP END PARALLEL
! Set the number of iterations
n = 100000
! Initialize the sum
sum = 0.0
!$omp parallel do collapse(3) default(none) private(i, j, k, x) shared(sum, n)
do i = 1, n
do j = 1, n
do k = 1, n
!print *, 'hello from thread:', OMP_GET_THREAD_NUM(), i, j, k
x = 1.0 / (real(i) + real(j) + real(k))
!$omp atomic
sum = sum + x
enddo
enddo
end do
!$omp end parallel do
print *, "The sum is: ", sum
call MPI_Finalize(ierror)
end program parallel_example
The number of threads per MPI processor is set to 5. So I expect 500% CPU usage of each MPI processor from 'top' command of my ubuntu.
This is compile and execution processes;
mpif90 -fopenmp test.F90 -o app.exe
mpirun -np 1 ./app.exe
If I use 'mpirun -np 1' or 'mpirun -np 2', the CPU usage is stuck by 200%, no more. But if I give more than 2, for example 'mpirun -np 3', I can finally see 500% CPU usage for each of those three MPI processors.
It is very unclear to me why I cannot get 500% CPU usage with one or two MPI processors. I am pretty sure I am missing something to setup the environment properly, but I really don't know what is wrong. So if anyone has knowledge on this, please consider sharing it with me.
- Update: The answer has been identified - addition of "--map-by node:pe=N" resolved the issue. This means even with one MPI processor, five threads is able to be used. It seems that argument is an OpenMPI convention, but I do not fully understand it yet. But to help with other people who may suffer with similar issues, I hope "--map-by node:pe=N" can resolve them. Thank you for the comments!
OMP_PROC_BIND=coreor similar? I do not recommmend usingomp_set_num_threadsand I do not recommend usinginclude "mpif.h".