1

I have an application where the root rank is sending messages to all ranks in the following way:

tag = 22
if( myrankid == 0 )then 
  do i = 1, nproc 
    if(I==1)then 
        do j = 1, nvert
           xyz((j-1)*3+1) = data((j-1)*3+1,1)       
           xyz((j-1)*3+2) = data((j-1)*3+2,1)
           xyz((j-1)*3+3) = data((j-1)*3+3,1)
        enddo 
     else
        call mpi_send(data, glb_nvert(i)*3, mpi_real, i-1, tag, comm, ierr)
     endif
   enddo
 else
   
   call mpi_recv(data, glb_nvert(i)*3, mpi_real, 0, tag,comm, stat,ierr)

 endif

My problem is that at only when running above 3000 ranks this pair hangs at a certain mpi rank (on my specific app it is rank 2009)

Now, I do check that the sizes and arrays are consistent and the only thing I found interesting was the comm. The comm is a communicator which I have duplicated from another MPI communicator.

When I print comm like print*, comm all ranks except the root prints the same integer, except for the root.

E.g.

The root prints:

-1006632941

while rhe remaining 2999 ranks prints:

-1006632951

Is that really what causing the problem?

I have tried using intel mpi and the cray mpi.

12
  • A minimal reproducible example is likely to be needed. Commented Jul 17, 2021 at 11:00
  • The value of the communicator variable only has meaning locally. There is no reason whatsoever for it to be the same on all procs, and commonly is not. Commented Jul 17, 2021 at 11:05
  • What you are doing is effectively a MPI_BCAST. If you modify the code to use that does it work? Commented Jul 17, 2021 at 16:17
  • @IanBush, in reality data is loaded from a file using hdf5 specific to each rank in the loop. I tried changing comm to MPI_COMM_WORLD, so even my initial suspicion whether comm was not behaving well is not correct, since the problem persisted even with when using MPI_COMM_WORLD. I can try to Remove the hdf5 in my real application and just pass dummy arrays and see if the problem is still there to take out the question whether the problem is somehow related to hdf5 library. But thanks for confirming that the communicator variable is locally set! Commented Jul 17, 2021 at 22:22
  • Without see exactly what you are doing in a minimal, complete, reproducible example it's impossible to say more. Commented Jul 18, 2021 at 7:04

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.