GLOO/NCCL connection issues [build from source]

ccoulombe · November 18, 2021, 7:26pm

Hello!!

Summary

My build of Pytorch v1.10.0 from source seem to have issues with the gloo and nccl backends, but works fine with mpi. The error alternate between:

Connection refused
Connection reset by peer
Socket Timeout

Even when the port is free and available.

Using the pre-built wheel from upstream (torch-1.10.0+cu113-cp39-cp39-linux_x86_64.whl), the issue cannot be reproduced. In other words, it works as expected.

Building GLOO with the same configuration (cuda and nccl) and running the tests suite : all tests passes.
Running NVIDIA NCCL examples works fine.

Test snippet

import torch.nn.parallel
import torch.distributed as dist
import os

os.environ['MASTER_ADDR'] = str(os.environ.get('HOST', '127.0.0.1'))
os.environ['MASTER_PORT'] = str(os.environ.get('PORT', 29500))
os.environ['RANK'] = str(os.environ.get('SLURM_LOCALID', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('SLURM_NTASKS', 2))

backend = os.environ.get('BACKEND', 'mpi')
print('Using backend:', backend)

dist.init_process_group(backend)
# dist.init_process_group(backend, init_method=f"tcp://{master_add}:{master", rank=rank, world_size=size)
my_rank = dist.get_rank()
my_size = dist.get_world_size()

print("my rank = %d  my size = %d" % (my_rank, my_size))

dist.destroy_process_group()

Ran on an exclusive single node (SLURM) with two tasks:

srun python ddp_torch.py

Details

Configuration

>>> print(torch.__config__.show())
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.4
  - NVCC architecture flags: -gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80
  - CuDNN 8.2  (built against CUDA 11.3)
  - Magma 2.6.1
  - Build settings:
      BLAS_INFO=flexi
      BUILD_TYPE=Release
      CUDA_VERSION=11.4
      CUDNN_VERSION=8.2.0
      CXX_COMPILER=/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/gcccore/9.3.0/bin/c++
      CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow
      LAPACK_INFO=flexi
      PERF_WITH_AVX=1
      PERF_WITH_AVX2=1
      PERF_WITH_AVX512=1
      TORCH_VERSION=1.10.0
      USE_CUDA=ON
      USE_CUDNN=ON
      USE_EXCEPTION_PTR=1
      USE_GFLAGS=OFF
      USE_GLOG=OFF
      USE_MKLDNN=ON
      USE_MPI=ON,
      USE_NCCL=ON,
      USE_NNPACK=ON,
      USE_OPENMP=ON,

Missing info from the above: NCCL v2.11.4

Build summary

--   USE_DISTRIBUTED       : ON
--     USE_MPI               : ON
--     USE_GLOO              : ON
--     USE_GLOO_WITH_OPENSSL : OFF
--     USE_TENSORPIPE        : ON

Issues

Tests suite

Pytorch test suite with my build from source:

distributed/test_nccl.py passes
distributed/test_c10d_gloo.py fails with connection errors
distributed/test_c10d_nccl.py fails with connection errors

Pytorch test suite with upstream wheel (torch-1.10.0+cu113-cp39-cp39-linux_x86_64.whl):

distributed/test_nccl.py passes
distributed/test_c10d_gloo.py passes
distributed/test_c10d_nccl.py passes

Manually running the test code

GLOO

(2149) ~ $ export BACKEND=gloo
(2149) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(2149) ~ $ srun python ddp_torch.py
terminate called after throwing an instance of 'std::system_error'
  what():  Connection refused

NCCL

(2149) ~ $ export NCCL_DEBUG=INFO
(2149) ~ $ export BACKEND=gloo
(2149) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(2149) ~ $ srun python ddp_torch.py
Using backend: nccl
my rank = 0  my size = 2
Using backend: nccl
my rank = 1  my size = 2
terminate called after throwing an instance of 'std::system_error'
  what():  Connection reset by peer

Or sometimes:

  what():  Connection refused

Or sometimes:

  what():  Socket Timeout

Or in rare cases, it works as expected.

Expected output

NCCL

(32450) ~ $ export NCCL_DEBUG=INFO
(32450) ~ $ export BACKEND=nccl
(32450) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(32450) ~ $ srun python ddp_torch.py
Using backend: nccl
my rank = 0  my size = 2
Using backend: nccl
my rank = 1  my size = 2

GLOO

(32450) ~ $ export BACKEND=gloo
(32450) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(32450) ~ $ srun python ddp_torch.py
Using backend: gloo
my rank = 0  my size = 2
Using backend: gloo
my rank = 1  my size = 2

Epilog

Any clues or hint on what might be the issue with the build from source?

Next is to build with debug and see if TORCH_DISTRIBUTED_DETAIL=DEBUG can help.

Related questions:

When using NCCL backend, with environment variable NCCL_DEBUG=INFO, no NCCL output is produced. How come?
Where can I find the build configuration from the CI build? The equivalent of CMake summary. I looked in the Github actions log, but did not find it.

Thank you very much!

rvarm1 · November 23, 2021, 1:22am

Hi, can you try building against the latest master branch and see if the issues persist/paste the error logs? A useful PR Revise the socket implementation of c10d by cbalioglu · Pull Request #68226 · pytorch/pytorch · GitHub has just landed significantly improving the implementation and error logging of the c10d store so the logs should provide a lot more details in the case of errors.

In addition, since this seems like a reproducible bug, could you file an issue over at Issues · pytorch/pytorch · GitHub with reproduction instructions so we can take a deeper look?

ccoulombe · November 29, 2021, 3:37pm

Thanks! Building from HEAD to include the PR, I got:

(31286) ~ $ export TORCH_DISTRIBUTED_DETAIL=DEBUG
(31286) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(31286) ~ $ echo $PORT
32967
(31286) ~ $ export BACKEND=gloo
(31286) ~ $ srun python ddp_torch.py
[W socket.cpp:634] The server socket on [localhost]:32967 is not yet listening (generic error: 111 - Connection refused).
terminate called after throwing an instance of 'std::system_error'
  what():  Connection reset by peer
Using backend: gloo
my rank = 1  my size = 2

It worked partially, as it is missing rank 0.
With NCCL backend, no information is printed, only the error.

I’ll create an issue then.
Issue : GLOO/NCCL connection issues [build from source] · Issue #69003 · pytorch/pytorch · GitHub