Hello!!
Summary
My build of Pytorch v1.10.0 from source seem to have issues with the gloo
and nccl
backends, but works fine with mpi
. The error alternate between:
- Connection refused
- Connection reset by peer
- Socket Timeout
Even when the port is free and available.
Using the pre-built wheel from upstream (torch-1.10.0+cu113-cp39-cp39-linux_x86_64.whl
), the issue cannot be reproduced. In other words, it works as expected.
Building GLOO with the same configuration (cuda and nccl) and running the tests suite : all tests passes.
Running NVIDIA NCCL examples works fine.
Test snippet
import torch.nn.parallel
import torch.distributed as dist
import os
os.environ['MASTER_ADDR'] = str(os.environ.get('HOST', '127.0.0.1'))
os.environ['MASTER_PORT'] = str(os.environ.get('PORT', 29500))
os.environ['RANK'] = str(os.environ.get('SLURM_LOCALID', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('SLURM_NTASKS', 2))
backend = os.environ.get('BACKEND', 'mpi')
print('Using backend:', backend)
dist.init_process_group(backend)
# dist.init_process_group(backend, init_method=f"tcp://{master_add}:{master", rank=rank, world_size=size)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d my size = %d" % (my_rank, my_size))
dist.destroy_process_group()
Ran on an exclusive single node (SLURM) with two tasks:
srun python ddp_torch.py
Details
Configuration
>>> print(torch.__config__.show())
PyTorch built with:
- GCC 9.3
- C++ Version: 201402
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.4
- NVCC architecture flags: -gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80
- CuDNN 8.2 (built against CUDA 11.3)
- Magma 2.6.1
- Build settings:
BLAS_INFO=flexi
BUILD_TYPE=Release
CUDA_VERSION=11.4
CUDNN_VERSION=8.2.0
CXX_COMPILER=/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/gcccore/9.3.0/bin/c++
CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow
LAPACK_INFO=flexi
PERF_WITH_AVX=1
PERF_WITH_AVX2=1
PERF_WITH_AVX512=1
TORCH_VERSION=1.10.0
USE_CUDA=ON
USE_CUDNN=ON
USE_EXCEPTION_PTR=1
USE_GFLAGS=OFF
USE_GLOG=OFF
USE_MKLDNN=ON
USE_MPI=ON,
USE_NCCL=ON,
USE_NNPACK=ON,
USE_OPENMP=ON,
Missing info from the above: NCCL v2.11.4
Build summary
-- USE_DISTRIBUTED : ON
-- USE_MPI : ON
-- USE_GLOO : ON
-- USE_GLOO_WITH_OPENSSL : OFF
-- USE_TENSORPIPE : ON
Issues
Tests suite
Pytorch test suite with my build from source:
- distributed/test_nccl.py passes
- distributed/test_c10d_gloo.py fails with connection errors
- distributed/test_c10d_nccl.py fails with connection errors
Pytorch test suite with upstream wheel (torch-1.10.0+cu113-cp39-cp39-linux_x86_64.whl
):
- distributed/test_nccl.py passes
- distributed/test_c10d_gloo.py passes
- distributed/test_c10d_nccl.py passes
Manually running the test code
GLOO
(2149) ~ $ export BACKEND=gloo
(2149) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(2149) ~ $ srun python ddp_torch.py
terminate called after throwing an instance of 'std::system_error'
what(): Connection refused
NCCL
(2149) ~ $ export NCCL_DEBUG=INFO
(2149) ~ $ export BACKEND=gloo
(2149) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(2149) ~ $ srun python ddp_torch.py
Using backend: nccl
my rank = 0 my size = 2
Using backend: nccl
my rank = 1 my size = 2
terminate called after throwing an instance of 'std::system_error'
what(): Connection reset by peer
Or sometimes:
what(): Connection refused
Or sometimes:
what(): Socket Timeout
Or in rare cases, it works as expected.
Expected output
NCCL
(32450) ~ $ export NCCL_DEBUG=INFO
(32450) ~ $ export BACKEND=nccl
(32450) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(32450) ~ $ srun python ddp_torch.py
Using backend: nccl
my rank = 0 my size = 2
Using backend: nccl
my rank = 1 my size = 2
GLOO
(32450) ~ $ export BACKEND=gloo
(32450) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(32450) ~ $ srun python ddp_torch.py
Using backend: gloo
my rank = 0 my size = 2
Using backend: gloo
my rank = 1 my size = 2
Epilog
Any clues or hint on what might be the issue with the build from source?
Next is to build with debug and see if TORCH_DISTRIBUTED_DETAIL=DEBUG
can help.
Related questions:
- When using NCCL backend, with environment variable
NCCL_DEBUG=INFO
, no NCCL output is produced. How come? - Where can I find the build configuration from the CI build? The equivalent of CMake summary. I looked in the Github actions log, but did not find it.
Thank you very much!