Fatal error in MPI_Allreduce

Question

I need to make cluster using MPICH. In this case first I tried these examples(http://mpitutorial.com/beginner-mpi-tutorial/) in a single machine and those were work as it expected. Then I created cluster according to this (https://help.ubuntu.com/community/MpichCluster) and run below example which is given there and it works.

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
 int myrank, nprocs;

 MPI_Init(&argc, &argv);
 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
 MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

 printf("Hello from processor %d of %d\n", myrank, nprocs);

 MPI_Finalize();
 return 0;

}

mpiexec -n 8 -f machinefile ./mpi_hello

So next I ran this example(http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/) but at that time I am getting this error. No idea what and where went wrong?

    Fatal error in MPI_Allreduce: A process has failed, error stack:
    MPI_Allreduce(861)........: MPI_Allreduce(sbuf=0x7ffff0f55630, rbuf=0x7ffff0f55634, count=1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD) failed
    MPIR_Allreduce_impl(719)..:
    MPIR_Allreduce_intra(362).:
    dequeue_and_set_error(888): Communication error with rank 1

    ===================================================================================
    =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
    =   EXIT CODE: 1
    =   CLEANING UP REMAINING PROCESSES
    =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
    ===================================================================================
    [proxy:0:1@ce-412] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
    [proxy:0:1@ce-412] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
    [proxy:0:1@ce-412] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
    [mpiexec@ce-411] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
    [mpiexec@ce-411] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
    [mpiexec@ce-411] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
    [mpiexec@ce-411] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

According to the Communication error with rank 1 message, your master node with rank 0 cannot connect to the node with rank 1, so you should look into that direction. You can try simple MPI_Send and MPI_Recv to ping the node 1 from root. — Alexey Subach
– Alexey Subach, Commented May 13, 2015 at 6:49
Those methods also did not work (mpitutorial.com/tutorials/mpi-send-and-receive) — GPrathap
– GPrathap, Commented May 13, 2015 at 7:03
Then it is definitely a networking setup error. Have you tried to check the following steps? — Alexey Subach
– Alexey Subach, Commented May 13, 2015 at 7:34

GPrathap · Accepted Answer · 2015-05-13 11:05:52Z

4

Yes as @Alexey mentioned It was a exactly network error. Here is the things what I did get this working.

1). Exported host file as HYDRA_HOST_FILE to understand for MPICH (for more information: https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager)

    export HYDRA_HOST_FILE=<path_to_host_file>/hosts

2). I had to solve this issue (http://lists.mpich.org/pipermail/discuss/2013-January/000285.html)

   -disable-hostname-propagation

Finally here is command which gives me correct connection among cluster nodes.

  mpiexec -launcher fork -disable-hostname-propagation  -f machinefile -np 4 ./Test

answered May 13, 2015 at 11:05

GPrathap

7,8707 gold badges72 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Fatal error in MPI_Allreduce

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related