2

I'm running following code which sends an array from rank 0 to 1 using command mpirun -n 2 python -u test_irecv.py > output 2>&1.

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
asyncr = 1
size_arr = 10000

if comm.Get_rank()==0:
    arrs = np.zeros(size_arr)
    if asyncr: comm.isend(arrs, dest=1).wait()
    else: comm.send(arrs, dest=1)
else:
    if asyncr: arrv = comm.irecv(source=0).wait()
    else: arrv = comm.recv(source=0)

print('Done!', comm.Get_rank())

Running in synchronous mode with asyncr = 0 gives the expected output

Done! 0
Done! 1

However running in asynchronous mode with asyncr = 1 gives errors as follows. I need to know why it runs okay in synchronous mode and not so in asynchronous mode.

Output with asyncr = 1:

Done! 0
[nia1477:420871:0:420871] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x138)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000643d1 ompi_errhandler_request_invoke()  ???:0
 2 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
 3 0x000000000008a8b5 __pyx_f_6mpi4py_3MPI_PyMPI_wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:49819
 4 0x000000000008a8b5 __pyx_pf_6mpi4py_3MPI_7Request_34wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83838
 5 0x000000000008a8b5 __pyx_pw_6mpi4py_3MPI_7Request_35wait()  /tmp/eb-A2FAdY/pip-req-build-dvnprmat/src/mpi4py.MPI.c:83813
 6 0x00000000000966a3 _PyMethodDef_RawFastCallKeywords()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/call.c:690
 7 0x000000000009eeb9 _PyMethodDescr_FastCallKeywords()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Objects/descrobject.c:288
 8 0x000000000006e611 call_function()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:4563
 9 0x000000000006e611 _PyEval_EvalFrameDefault()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3103
10 0x0000000000177644 _PyEval_EvalCodeWithName()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3923
11 0x000000000017774e PyEval_EvalCodeEx()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:3952
12 0x000000000017777b PyEval_EvalCode()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/ceval.c:524
13 0x00000000001aab72 run_mod()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:1035
14 0x00000000001aab72 PyRun_FileExFlags()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:988
15 0x00000000001aace6 PyRun_SimpleFileExFlags()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Python/pythonrun.c:430
16 0x00000000001cad47 pymain_run_file()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:425
17 0x00000000001cad47 pymain_run_filename()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:1520
18 0x00000000001cad47 pymain_run_python()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2520
19 0x00000000001cad47 pymain_main()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2662
20 0x00000000001cb1ca _Py_UnixMain()  /dev/shm/mboisson/avx2/Python/3.7.0/dummy-dummy/Python-3.7.0/Modules/main.c:2697
21 0x00000000000202e0 __libc_start_main()  ???:0
22 0x00000000004006ba _start()  /tmp/nix-build-glibc-2.24.drv-0/glibc-2.24/csu/../sysdeps/x86_64/start.S:120
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 420871 on node nia1477 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The versions are as follows:

  • Python: 3.7.0
  • mpi4py: 3.0.0
  • mpiexec --version gives mpiexec (OpenRTE) 3.1.2
  • mpicc -v gives icc version 18.0.3 (gcc version 7.3.0 compatibility)

Running with asyncr = 1 in another system with MPICH gave the following output.

Done! 0
Traceback (most recent call last):
  File "test_irecv.py", line 14, in <module>
    if asyncr: arrv = comm.irecv(source=0).wait()
  File "mpi4py/MPI/Request.pyx", line 235, in mpi4py.MPI.Request.wait
  File "mpi4py/MPI/msgpickle.pxi", line 411, in mpi4py.MPI.PyMPI_wait
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23830,1],1]
  Exit code:    1
--------------------------------------------------------------------------
[master:01977] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[master:01977] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

1 Answer 1

2

Apparently this is a known problem in mpi4py as described in https://bitbucket.org/mpi4py/mpi4py/issues/65/mpi_err_truncate-message-truncated-when. Lisandro Dalcin says

The implementation of irecv() for large messages requires users to pass a buffer-like object large enough to receive the pickled stream. This is not documented (as most of mpi4py), and even non-obvious and unpythonic...

The fix is to pass a large enough pre-allocated bytearray to irecv. A working example is as follows.

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
size_arr = 10000

if comm.Get_rank()==0:
    arrs = np.zeros(size_arr)
    comm.isend(arrs, dest=1).wait()
else:
    arrv = comm.irecv(bytearray(1<<20), source=0).wait()

print('Done!', comm.Get_rank())
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.