Multi-node distributed training, DDP constructor hangs

Hi all, I am trying to get a basic multi-node training example working. In my case, the DDP constructor is hanging; however, NCCL logs imply what appears to be memory being allocated in the underlying cuda area (?).

I have verified telnet and nc connection between all my ports between my two machines, for the record.

I have looked through the following related forum posts: 89711 which doesn’t seem to have a resolution in it, 123220 which just tells people to look at the official docs, which only have an example for single-node setups; as well as the other github issues: 3 I actually make it past the NCCL init stage and memory is allocated according to the logs.

Some other things I have tried with no success

  • Install pytorch-nightly build for my torch + cuda version
  • NCCL_P2P_DISABLE=1

Below is my code snippet.

import os
import torch.distributed as dist
import datetime
import sys
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel

dist.init_process_group(backend="nccl")
print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net = nn.Linear(10, 5)
    def forward(self, x):
        return self.net(self.relu(self.net1(x)))

model = ToyModel().cuda(device=0)
print('model = ToyModel().cuda(device=0) called')
ddp_model = DistributedDataParallel(model, device_ids=[0], output_device=0)
print('ddp_model = DistributedDataParallel(model, device_ids=[0]) called')
#node 1
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=0 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py
# node 2
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=1 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py

Node 1 output

CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=0 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py 
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[ 25420 ] world_size = 2, rank = 0, backend=nccl
model = ToyModel().cuda(device=0) called
blah-0:25420:25420 [0] NCCL INFO Bootstrap : Using eth0:10.106.165.46<0>
blah-0:25420:25420 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
blah-0:25420:25420 [0] NCCL INFO NET/IB : No device found.
blah-0:25420:25420 [0] NCCL INFO NET/Socket : Using [0]eth0:10.106.165.46<0>
blah-0:25420:25420 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
blah-0:25420:25497 [0] NCCL INFO bootstrap.cc:107 Mem Alloc Size 28 pointer 0x7fedf8000b20
blah-0:25420:25498 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7fedfc001200
blah-0:25420:25498 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7fedfc00b440
blah-0:25420:25498 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7fee01800000
blah-0:25420:25498 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7fedfc00c230
blah-0:25420:25498 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7fedfc058240
blah-0:25420:25498 [0] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7fedfc058290
blah-0:25420:25498 [0] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7fedfc0582b0
blah-0:25420:25498 [0] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7fedfc0582d0
blah-0:25420:25498 [0] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7fedfc0582f0
blah-0:25420:25498 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7fedfc058310
blah-0:25420:25497 [0] NCCL INFO bootstrap.cc:121 Mem Alloc Size 56 pointer 0x7fedf80066e0
blah-0:25420:25497 [0] NCCL INFO bootstrap.cc:122 Mem Alloc Size 56 pointer 0x7fedf8006720

Node 2 output

CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=1 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[ 22473 ] world_size = 2, rank = 1, backend=nccl
model = ToyModel().cuda(device=0) called
blah-1:22473:22473 [0] NCCL INFO Bootstrap : Using eth0:10.106.111.185<0>
blah-1:22473:22473 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
blah-1:22473:22473 [0] NCCL INFO NET/IB : No device found.
blah-1:22473:22473 [0] NCCL INFO NET/Socket : Using [0]eth0:10.106.111.185<0>
blah-1:22473:22473 [0] NCCL INFO Using network Socket
blah-1:22473:22566 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f326c001200
blah-1:22473:22566 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f326c00b440
blah-1:22473:22566 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f3275800000
blah-1:22473:22566 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f326c00c230
blah-1:22473:22566 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f326c058240
blah-1:22473:22566 [0] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7f326c058290
blah-1:22473:22566 [0] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7f326c0582b0
blah-1:22473:22566 [0] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7f326c0582d0
blah-1:22473:22566 [0] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7f326c0582f0
blah-1:22473:22566 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f326c058310

It looks like NCCL is allocating memory, I am making it past the model = ToyModel().cuda(device=0) called step, aka it is getting stuck in the DDP init call.

Any ideas?

Versions

PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.26

Python version: 3.7.5 (default, Feb 23 2021, 13:22:40) [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-4.14.238-182.422.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla T4
GPU 1: Tesla T4
GPU 2: Tesla T4
GPU 3: Tesla T4

Nvidia driver version: 470.57.02
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.4
[pip3] torch==1.11.0
[conda] Could not collect

I guess these messages indicate the failure:

[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).

They looked menacing but they occur even in the basic cases during the init_process_group stage and do not prevent the scripts from finishing. It also appears that the first one at least is for ip6 addrs (the [::] notation) which shouldn’t really matter if I have ip4 addrs setup and working.

If those addresses/sockets were failing, I would anticipate NCCL to stall during initialization, but it doesn’t. We successfully see blah-0:25420:25420 [0] NCCL INFO Bootstrap : Using eth0:10.106.165.46<0> and no other mention of failure.

I can try to dig in more on that though. I am trying to manually debug NCCL + openmpi, but it requires manually re-installing everything since the libraries that come with torch don’t seem to be sufficient to run the development tests.

@ptrblck Those issues are in fact a non-issue according to the behavior I have seen and the source code.

Once again, I can actually establish a TCPStore connection, and those warnings appear to be the underlying utils attempting an ipv6 connection (failing) and then falling back to ipv4 (and succeeding). As i called out, NCCL does establish a connection on ipv4.

Utils code here, and I confirmed ipv6 is disabled in my machine’s kernel.

I am able to successfully run nccl tests on my GPUs on a single node (github). Since torch doesn’t use MPI by default, testing out the nccl tests with mpi to cross nodes seems not entirely how reflective of pytorch’s distributed setup.

Are you aware of any other issues similar to this? Could you possibly rope in someone else who might be able to assist?

Thank you!

Edit: I linked to a fork of pytorch in the utils.c++ code, will look again at source code on pytorch itself. Looks like there is a series of calls trying to create the socket, ipv6 first, the ipv4 here, so wrong link originally but same logic.

To follow up more fully on those c10d logs, I finally figured out how to get the logging setup, and below confirms my suspicions. It fails on ipv6 then fallsback to ipv4. Which means me having issues within the constructor and in NCCL is the source, not the connectivity.

[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (blah-0.blah.ml-dev.svc.cluster.local, 29500).
[I socket.cpp:649] [c10d - trace] The client socket is attempting to connect to [blah-0.blah.ml-dev.svc.cluster.local]:29500.
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:29500 (errno: 97 - Address family not supported by protocol).
[I socket.cpp:590] [c10d - debug] The client socket will attempt to connect to an IPv4 address of (blah-0.blah.ml-dev.svc.cluster.local, 29500).
[I socket.cpp:649] [c10d - trace] The client socket is attempting to connect to blah-0.blah.ml-dev.svc.cluster.local:29500.
[I socket.cpp:725] [c10d] The client socket has connected to blah-0.blah.ml-dev.svc.cluster.local:29500 on blah-0.blah.ml-dev.svc.cluster.local:36722.

Bumping as Ive isolated issues.