3

I need to make cluster using MPICH. In this case first I tried these examples(http://mpitutorial.com/beginner-mpi-tutorial/) in a single machine and those were work as it expected. Then I created cluster according to this (https://help.ubuntu.com/community/MpichCluster) and run below example which is given there and it works.

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
 int myrank, nprocs;

 MPI_Init(&argc, &argv);
 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
 MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

 printf("Hello from processor %d of %d\n", myrank, nprocs);

 MPI_Finalize();
 return 0;

}

mpiexec -n 8 -f machinefile ./mpi_hello

So next I ran this example(http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/) but at that time I am getting this error. No idea what and where went wrong?

    Fatal error in MPI_Allreduce: A process has failed, error stack:
    MPI_Allreduce(861)........: MPI_Allreduce(sbuf=0x7ffff0f55630, rbuf=0x7ffff0f55634, count=1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD) failed
    MPIR_Allreduce_impl(719)..:
    MPIR_Allreduce_intra(362).:
    dequeue_and_set_error(888): Communication error with rank 1

    ===================================================================================
    =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
    =   EXIT CODE: 1
    =   CLEANING UP REMAINING PROCESSES
    =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
    ===================================================================================
    [proxy:0:1@ce-412] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
    [proxy:0:1@ce-412] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
    [proxy:0:1@ce-412] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
    [mpiexec@ce-411] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
    [mpiexec@ce-411] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
    [mpiexec@ce-411] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
    [mpiexec@ce-411] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
3
  • According to the Communication error with rank 1 message, your master node with rank 0 cannot connect to the node with rank 1, so you should look into that direction. You can try simple MPI_Send and MPI_Recv to ping the node 1 from root. Commented May 13, 2015 at 6:49
  • Those methods also did not work (mpitutorial.com/tutorials/mpi-send-and-receive) Commented May 13, 2015 at 7:03
  • 1
    Then it is definitely a networking setup error. Have you tried to check the following steps? Commented May 13, 2015 at 7:34

1 Answer 1

4

Yes as @Alexey mentioned It was a exactly network error. Here is the things what I did get this working.

1). Exported host file as HYDRA_HOST_FILE to understand for MPICH (for more information: https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager)

    export HYDRA_HOST_FILE=<path_to_host_file>/hosts

2). I had to solve this issue (http://lists.mpich.org/pipermail/discuss/2013-January/000285.html)

   -disable-hostname-propagation

Finally here is command which gives me correct connection among cluster nodes.

  mpiexec -launcher fork -disable-hostname-propagation  -f machinefile -np 4 ./Test
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.