Hi there !
I am following the tutorial for writing distributed applications.
Here is the basic example code. Everything works well as long as I don’t try using cuda.
Is there a specific step to enable it that I missed ? I am getting a segfault…
It might also be linked with some kind of permission issue since the signal code is “Invalid Permissions”.
Any help would be really appreciated !
import os
import torch
import torch.distributed as dist
import platform
def run(rank, size):
tensor = torch.zeros(1).cuda(0)
if rank == 0:
# Send the tensor to process 1
tensor += 1
dist.send(tensor=tensor, dst=1)
else:
# Receive tensor from process 0
dist.recv(tensor=tensor, src=0)
print('Rank ', rank, ' has data ', tensor[0])
def init_processes(fn):
""" Initialize the distributed environment. """
dist.init_process_group('mpi')
rank = dist.get_rank()
size = dist.get_world_size()
print('I am rank ', rank, ' on ', platform.node())
fn(rank, size)
if __name__ == "__main__":
init_processes(run)
And the command outputs:
$ mpiexec -np 2 python main.py
WARNING: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to restrictive ptrace settings.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: iccluster131
--------------------------------------------------------------------------
I am rank 0 on iccluster131
I am rank 1 on iccluster131
[iccluster131:19580] *** Process received signal ***
[iccluster131:19580] Signal: Segmentation fault (11)
[iccluster131:19580] Signal code: Invalid permissions (2)
[iccluster131:19580] Failing at address: 0x420a900000
[iccluster131:19580] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fda4b7ae390]
[iccluster131:19580] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14db15)[0x7fda4ac08b15]
[iccluster131:19580] [ 2] /home/me/.conda/envs/pytorch-env/lib/./libopen-pal.so.40(opal_convertor_pack+0x175)[0x7fda1ae5e925]
[iccluster131:19580] [ 3] /home/me/.conda/envs/pytorch-env/lib/openmpi/mca_btl_vader.so(mca_btl_vader_sendi+0x383)[0x7fda10799803]
[iccluster131:19580] [ 4] /home/me/.conda/envs/pytorch-env/lib/openmpi/mca_pml_ob1.so(+0xb6db)[0x7fda0bde56db]
[iccluster131:19580] [ 5] /home/me/.conda/envs/pytorch-env/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x690)[0x7fda0bde7120]
[iccluster131:19580] [ 6] /home/me/.conda/envs/pytorch-env/lib/libmpi.so.40(PMPI_Send+0xf2)[0x7fda254bc882]
[iccluster131:19580] [ 7] /home/me/.conda/envs/pytorch-env/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_Z15THDPModule_sendP7_objectS0_+0xd5)[0x7fda42831e95]
[iccluster131:19580] [ 8] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x18e)[0x7fda4ba8660e]
[iccluster131:19580] [ 9] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(+0x16669a)[0x7fda4bb2069a]
[iccluster131:19580] [10] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x4186)[0x7fda4bb25046]
[iccluster131:19580] [11] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(+0x16629e)[0x7fda4bb2029e]
[iccluster131:19580] [12] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(+0x1665b2)[0x7fda4bb205b2]
[iccluster131:19580] [13] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x3bd8)[0x7fda4bb24a98]
[iccluster131:19580] [14] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(+0x165930)[0x7fda4bb1f930]
[iccluster131:19580] [15] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(+0x166854)[0x7fda4bb20854]
[iccluster131:19580] [16] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x4186)[0x7fda4bb25046]
[iccluster131:19580] [17] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(+0x165930)[0x7fda4bb1f930]
[iccluster131:19580] [18] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(+0x166854)[0x7fda4bb20854]
[iccluster131:19580] [19] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x4186)[0x7fda4bb25046]
[iccluster131:19580] [20] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(+0x16629e)[0x7fda4bb2029e]
[iccluster131:19580] [21] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(PyEval_EvalCodeEx+0x6d)[0x7fda4bb208cd]
[iccluster131:19580] [22] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(PyEval_EvalCode+0x3b)[0x7fda4bb2091b]
[iccluster131:19580] [23] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(PyRun_FileExFlags+0xb2)[0x7fda4bb5b472]
[iccluster131:19580] [24] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(PyRun_SimpleFileExFlags+0xe7)[0x7fda4bb5b5d7]
[iccluster131:19580] [25] /home/me/.conda/envs/pytorch-env/bin/../lib/libpython3.6m.so.1.0(Py_Main+0xf2c)[0x7fda4bb766dc]
[iccluster131:19580] [26] python(main+0x16e)[0x400bce]
[iccluster131:19580] [27] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fda4aadb830]
[iccluster131:19580] [28] python[0x400c95]
[iccluster131:19580] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node iccluster131 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[iccluster131:19575] 1 more process has sent help message help-btl-vader.txt / cma-permission-denied
[iccluster131:19575] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
My mpi version:
mpiexec --version
mpiexec (OpenRTE) 3.0.0
with which pytorch was compiled.
python --version
Python 3.6.3 :: Anaconda, Inc.
python -c 'import torch;print(torch.__version__)'
0.4.0a0+0ab68b8