Distributed pytorch with mpi

ph0123 · April 16, 2020, 8:58pm

Hi all ,

I try to run pytorch with distributed system.
I run test1.py as below

import torch
import torch.distributed as dist
def main(rank, world):
if rank == 0:
x = torch.tensor([1., -1.]) # Tensor of interest
dist.send(x, dst=1)
print(‘Rank-0 has sent the following tensor to Rank-1’)
print(x)
else:
z = torch.tensor([0., 0.]) # A holder for recieving the tensor
dist.recv(z, src=0)
print(‘Rank-1 has recieved the following tensor from Rank-0’)
print(z)
if name == ‘main’:
dist.init_process_group(backend=‘mpi’)
main(dist.get_rank(), dist.get_world_size())

Then I run with single machine.

mpiexec -n 2 python test1.py

Finally, the error is

Traceback (most recent call last):
  File "test1.py", line 17, in <module>
    dist.init_process_group(backend='mpi')
  File "/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 392, in init_process_group
    timeout=timeout)
  File "/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 452, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have MPI built in")
RuntimeError: Distributed package doesn't have MPI built in
Traceback (most recent call last):
  File "test1.py", line 17, in <module>
    dist.init_process_group(backend='mpi')
  File "/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 392, in init_process_group
    timeout=timeout)
  File "/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 452, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have MPI built in")
RuntimeError: Distributed package doesn't have MPI built in
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31741,1],1]
  Exit code:    1
--------------------------------------------------------------------------

I also installed pytorch with

pip install torch torchvision

Please help me.
Thanks,

pritamdamania87 · April 17, 2020, 12:47am

You need to build pytorch from source to enable MPI: https://pytorch.org/docs/stable/distributed.html#backends-that-come-with-pytorch

ph0123 · April 17, 2020, 10:51am

HI,

I followed the instruction on git.

clone code from git.

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch

install dependencies.

pip install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi

run setup and install. The errors were here.
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/…/"}
python setup.py install

Building wheel torch-1.6.0a0+32bbf12
-- Building version 1.6.0a0+32bbf12
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/cluster/home/cnphuong/pytorch/torch -DCMAKE_PREFIX_PATH=/cluster/software/Anaconda3/2019.03/bin/../ -DNUMPY_INCLUDE_DIR=/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/cluster/home/cnphuong/my_environment/bin/python -DPYTHON_INCLUDE_DIR=/cluster/software/Python/3.6.6-foss-2018b/include/python3.6m -DPYTHON_LIBRARY=/cluster/software/Python/3.6.6-foss-2018b/lib/libpython3.6m.so.1.0 -DTORCH_BUILD_VERSION=1.6.0a0+32bbf12 -DUSE_NUMPY=True /cluster/home/cnphuong/pytorch
CMake Error: The source directory "/cluster/home/cnphuong/pytorch" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
Traceback (most recent call last):
  File "setup.py", line 738, in <module>
    build_deps()
  File "setup.py", line 320, in build_deps
    cmake=cmake)
  File "/cluster/home/cnphuong/pytorch/tools/build_pytorch_libs.py", line 59, in build_caffe2
    rerun_cmake)
  File "/cluster/home/cnphuong/pytorch/tools/setup_helpers/cmake.py", line 324, in generate
    self.run(args, env=my_env)
  File "/cluster/home/cnphuong/pytorch/tools/setup_helpers/cmake.py", line 141, in run
    check_call(command, cwd=self.build_dir, env=env)
  File "/cluster/software/Python/3.6.6-foss-2018b/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '-GNinja', '-DBUILD_PYTHON=True', '-DBUILD_TEST=True', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/cluster/home/cnphuong/pytorch/torch', '-DCMAKE_PREFIX_PATH=/cluster/software/Anaconda3/2019.03/bin/../', '-DNUMPY_INCLUDE_DIR=/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/numpy/core/include', '-DPYTHON_EXECUTABLE=/cluster/home/cnphuong/my_environment/bin/python', '-DPYTHON_INCLUDE_DIR=/cluster/software/Python/3.6.6-foss-2018b/include/python3.6m', '-DPYTHON_LIBRARY=/cluster/software/Python/3.6.6-foss-2018b/lib/libpython3.6m.so.1.0', '-DTORCH_BUILD_VERSION=1.6.0a0+32bbf12', '-DUSE_NUMPY=True', '/cluster/home/cnphuong/pytorch']' returned non-zero exit status 1.

I only want to run with multi CPUs. These step is correct or not?
Thanks,

pritamdamania87 · April 18, 2020, 12:53am

Can you check if that directory has a CMakeLists.txt file? Usually there should be a CMakeLists.txt file in the top level directory when you clone pytorch.

ph0123 · April 18, 2020, 5:48pm

Oh. I did not see CMakeLists.txt. I will try to clone again.
Thanks,