GLOO/NCCL connection issues [build from source]

Hello!!

Summary

My build of Pytorch v1.10.0 from source seem to have issues with the gloo and nccl backends, but works fine with mpi. The error alternate between:

  • Connection refused
  • Connection reset by peer
  • Socket Timeout

Even when the port is free and available.

Using the pre-built wheel from upstream (torch-1.10.0+cu113-cp39-cp39-linux_x86_64.whl), the issue cannot be reproduced. In other words, it works as expected.

Building GLOO with the same configuration (cuda and nccl) and running the tests suite : all tests passes.
Running NVIDIA NCCL examples works fine.

Test snippet

import torch.nn.parallel
import torch.distributed as dist
import os

os.environ['MASTER_ADDR'] = str(os.environ.get('HOST', '127.0.0.1'))
os.environ['MASTER_PORT'] = str(os.environ.get('PORT', 29500))
os.environ['RANK'] = str(os.environ.get('SLURM_LOCALID', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('SLURM_NTASKS', 2))

backend = os.environ.get('BACKEND', 'mpi')
print('Using backend:', backend)

dist.init_process_group(backend)
# dist.init_process_group(backend, init_method=f"tcp://{master_add}:{master", rank=rank, world_size=size)
my_rank = dist.get_rank()
my_size = dist.get_world_size()

print("my rank = %d  my size = %d" % (my_rank, my_size))

dist.destroy_process_group()

Ran on an exclusive single node (SLURM) with two tasks:

srun python ddp_torch.py

Details

Configuration

>>> print(torch.__config__.show())
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.4
  - NVCC architecture flags: -gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80
  - CuDNN 8.2  (built against CUDA 11.3)
  - Magma 2.6.1
  - Build settings:
      BLAS_INFO=flexi
      BUILD_TYPE=Release
      CUDA_VERSION=11.4
      CUDNN_VERSION=8.2.0
      CXX_COMPILER=/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/gcccore/9.3.0/bin/c++
      CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow
      LAPACK_INFO=flexi
      PERF_WITH_AVX=1
      PERF_WITH_AVX2=1
      PERF_WITH_AVX512=1
      TORCH_VERSION=1.10.0
      USE_CUDA=ON
      USE_CUDNN=ON
      USE_EXCEPTION_PTR=1
      USE_GFLAGS=OFF
      USE_GLOG=OFF
      USE_MKLDNN=ON
      USE_MPI=ON,
      USE_NCCL=ON,
      USE_NNPACK=ON,
      USE_OPENMP=ON,

Missing info from the above: NCCL v2.11.4

Build summary

--   USE_DISTRIBUTED       : ON
--     USE_MPI               : ON
--     USE_GLOO              : ON
--     USE_GLOO_WITH_OPENSSL : OFF
--     USE_TENSORPIPE        : ON

Issues

Tests suite

Pytorch test suite with my build from source:

  • distributed/test_nccl.py passes
  • distributed/test_c10d_gloo.py fails with connection errors
  • distributed/test_c10d_nccl.py fails with connection errors

Pytorch test suite with upstream wheel (torch-1.10.0+cu113-cp39-cp39-linux_x86_64.whl):

  • distributed/test_nccl.py passes
  • distributed/test_c10d_gloo.py passes
  • distributed/test_c10d_nccl.py passes

Manually running the test code

GLOO

(2149) ~ $ export BACKEND=gloo
(2149) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(2149) ~ $ srun python ddp_torch.py
terminate called after throwing an instance of 'std::system_error'
  what():  Connection refused

NCCL

(2149) ~ $ export NCCL_DEBUG=INFO
(2149) ~ $ export BACKEND=gloo
(2149) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(2149) ~ $ srun python ddp_torch.py
Using backend: nccl
my rank = 0  my size = 2
Using backend: nccl
my rank = 1  my size = 2
terminate called after throwing an instance of 'std::system_error'
  what():  Connection reset by peer

Or sometimes:

  what():  Connection refused

Or sometimes:

  what():  Socket Timeout

Or in rare cases, it works as expected.

Expected output

NCCL

(32450) ~ $ export NCCL_DEBUG=INFO
(32450) ~ $ export BACKEND=nccl
(32450) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(32450) ~ $ srun python ddp_torch.py
Using backend: nccl
my rank = 0  my size = 2
Using backend: nccl
my rank = 1  my size = 2

GLOO

(32450) ~ $ export BACKEND=gloo
(32450) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(32450) ~ $ srun python ddp_torch.py
Using backend: gloo
my rank = 0  my size = 2
Using backend: gloo
my rank = 1  my size = 2

Epilog

Any clues or hint on what might be the issue with the build from source?

Next is to build with debug and see if TORCH_DISTRIBUTED_DETAIL=DEBUG can help.

Related questions:

  1. When using NCCL backend, with environment variable NCCL_DEBUG=INFO, no NCCL output is produced. How come?
  2. Where can I find the build configuration from the CI build? The equivalent of CMake summary. I looked in the Github actions log, but did not find it.

Thank you very much!

Hi, can you try building against the latest master branch and see if the issues persist/paste the error logs? A useful PR Revise the socket implementation of c10d by cbalioglu · Pull Request #68226 · pytorch/pytorch · GitHub has just landed significantly improving the implementation and error logging of the c10d store so the logs should provide a lot more details in the case of errors.

In addition, since this seems like a reproducible bug, could you file an issue over at Issues · pytorch/pytorch · GitHub with reproduction instructions so we can take a deeper look?

Thanks! Building from HEAD to include the PR, I got:

(31286) ~ $ export TORCH_DISTRIBUTED_DETAIL=DEBUG
(31286) ~ $ export PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
(31286) ~ $ echo $PORT
32967
(31286) ~ $ export BACKEND=gloo
(31286) ~ $ srun python ddp_torch.py
[W socket.cpp:634] The server socket on [localhost]:32967 is not yet listening (generic error: 111 - Connection refused).
terminate called after throwing an instance of 'std::system_error'
  what():  Connection reset by peer
Using backend: gloo
my rank = 1  my size = 2

It worked partially, as it is missing rank 0.
With NCCL backend, no information is printed, only the error.

I’ll create an issue then.
Issue : GLOO/NCCL connection issues [build from source] · Issue #69003 · pytorch/pytorch · GitHub