I have an application where the root rank is sending messages to all ranks in the following way:
tag = 22
if( myrankid == 0 )then
do i = 1, nproc
if(I==1)then
do j = 1, nvert
xyz((j-1)*3+1) = data((j-1)*3+1,1)
xyz((j-1)*3+2) = data((j-1)*3+2,1)
xyz((j-1)*3+3) = data((j-1)*3+3,1)
enddo
else
call mpi_send(data, glb_nvert(i)*3, mpi_real, i-1, tag, comm, ierr)
endif
enddo
else
call mpi_recv(data, glb_nvert(i)*3, mpi_real, 0, tag,comm, stat,ierr)
endif
My problem is that at only when running above 3000 ranks this pair hangs at a certain mpi rank (on my specific app it is rank 2009)
Now, I do check that the sizes and arrays are consistent and the only thing I found interesting was the comm. The comm is a communicator which I have duplicated from another MPI communicator.
When I print comm like print*, comm all ranks except the root prints the same integer, except for the root.
E.g.
The root prints:
-1006632941
while rhe remaining 2999 ranks prints:
-1006632951
Is that really what causing the problem?
I have tried using intel mpi and the cray mpi.
MPI_BCAST. If you modify the code to use that does it work?datais loaded from a file usinghdf5specific to each rank in the loop. I tried changingcommtoMPI_COMM_WORLD, so even my initial suspicion whethercommwas not behaving well is not correct, since the problem persisted even with when usingMPI_COMM_WORLD. I can try to Remove thehdf5in my real application and just pass dummy arrays and see if the problem is still there to take out the question whether the problem is somehow related tohdf5library. But thanks for confirming that the communicator variable is locally set!