I try to run pytorch with distributed system.
I run test1.py as below
import torch
import torch.distributed as dist
def main(rank, world):
if rank == 0:
x = torch.tensor([1., -1.]) # Tensor of interest
dist.send(x, dst=1)
print(‘Rank-0 has sent the following tensor to Rank-1’)
print(x)
else:
z = torch.tensor([0., 0.]) # A holder for recieving the tensor
dist.recv(z, src=0)
print(‘Rank-1 has recieved the following tensor from Rank-0’)
print(z)
if name == ‘main’:
dist.init_process_group(backend=‘mpi’)
main(dist.get_rank(), dist.get_world_size())
Then I run with single machine.
mpiexec -n 2 python test1.py
Finally, the error is
Traceback (most recent call last):
File "test1.py", line 17, in <module>
dist.init_process_group(backend='mpi')
File "/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 392, in init_process_group
timeout=timeout)
File "/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 452, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have MPI built in")
RuntimeError: Distributed package doesn't have MPI built in
Traceback (most recent call last):
File "test1.py", line 17, in <module>
dist.init_process_group(backend='mpi')
File "/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 392, in init_process_group
timeout=timeout)
File "/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 452, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have MPI built in")
RuntimeError: Distributed package doesn't have MPI built in
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[31741,1],1]
Exit code: 1
--------------------------------------------------------------------------
run setup and install. The errors were here.
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/…/"}
python setup.py install
Building wheel torch-1.6.0a0+32bbf12
-- Building version 1.6.0a0+32bbf12
cmake -GNinja -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/cluster/home/cnphuong/pytorch/torch -DCMAKE_PREFIX_PATH=/cluster/software/Anaconda3/2019.03/bin/../ -DNUMPY_INCLUDE_DIR=/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/cluster/home/cnphuong/my_environment/bin/python -DPYTHON_INCLUDE_DIR=/cluster/software/Python/3.6.6-foss-2018b/include/python3.6m -DPYTHON_LIBRARY=/cluster/software/Python/3.6.6-foss-2018b/lib/libpython3.6m.so.1.0 -DTORCH_BUILD_VERSION=1.6.0a0+32bbf12 -DUSE_NUMPY=True /cluster/home/cnphuong/pytorch
CMake Error: The source directory "/cluster/home/cnphuong/pytorch" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
Traceback (most recent call last):
File "setup.py", line 738, in <module>
build_deps()
File "setup.py", line 320, in build_deps
cmake=cmake)
File "/cluster/home/cnphuong/pytorch/tools/build_pytorch_libs.py", line 59, in build_caffe2
rerun_cmake)
File "/cluster/home/cnphuong/pytorch/tools/setup_helpers/cmake.py", line 324, in generate
self.run(args, env=my_env)
File "/cluster/home/cnphuong/pytorch/tools/setup_helpers/cmake.py", line 141, in run
check_call(command, cwd=self.build_dir, env=env)
File "/cluster/software/Python/3.6.6-foss-2018b/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '-GNinja', '-DBUILD_PYTHON=True', '-DBUILD_TEST=True', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/cluster/home/cnphuong/pytorch/torch', '-DCMAKE_PREFIX_PATH=/cluster/software/Anaconda3/2019.03/bin/../', '-DNUMPY_INCLUDE_DIR=/cluster/home/cnphuong/my_environment/lib/python3.6/site-packages/numpy/core/include', '-DPYTHON_EXECUTABLE=/cluster/home/cnphuong/my_environment/bin/python', '-DPYTHON_INCLUDE_DIR=/cluster/software/Python/3.6.6-foss-2018b/include/python3.6m', '-DPYTHON_LIBRARY=/cluster/software/Python/3.6.6-foss-2018b/lib/libpython3.6m.so.1.0', '-DTORCH_BUILD_VERSION=1.6.0a0+32bbf12', '-DUSE_NUMPY=True', '/cluster/home/cnphuong/pytorch']' returned non-zero exit status 1.
I only want to run with multi CPUs. These step is correct or not?
Thanks,
Can you check if that directory has a CMakeLists.txt file? Usually there should be a CMakeLists.txt file in the top level directory when you clone pytorch.