Multi-node distributed training, DDP constructor hangs

Asciotti53 · March 17, 2022, 6:37pm

Hi all, I am trying to get a basic multi-node training example working. In my case, the DDP constructor is hanging; however, NCCL logs imply what appears to be memory being allocated in the underlying cuda area (?).

I have verified telnet and nc connection between all my ports between my two machines, for the record.

I have looked through the following related forum posts: 89711 which doesn’t seem to have a resolution in it, 123220 which just tells people to look at the official docs, which only have an example for single-node setups; as well as the other github issues: 3 I actually make it past the NCCL init stage and memory is allocated according to the logs.

Some other things I have tried with no success

Install pytorch-nightly build for my torch + cuda version
NCCL_P2P_DISABLE=1

Below is my code snippet.

import os
import torch.distributed as dist
import datetime
import sys
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel

dist.init_process_group(backend="nccl")
print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net = nn.Linear(10, 5)
    def forward(self, x):
        return self.net(self.relu(self.net1(x)))

model = ToyModel().cuda(device=0)
print('model = ToyModel().cuda(device=0) called')
ddp_model = DistributedDataParallel(model, device_ids=[0], output_device=0)
print('ddp_model = DistributedDataParallel(model, device_ids=[0]) called')

#node 1
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=0 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py
# node 2
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=1 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py

Node 1 output

CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=0 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py 
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[ 25420 ] world_size = 2, rank = 0, backend=nccl
model = ToyModel().cuda(device=0) called
blah-0:25420:25420 [0] NCCL INFO Bootstrap : Using eth0:10.106.165.46<0>
blah-0:25420:25420 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
blah-0:25420:25420 [0] NCCL INFO NET/IB : No device found.
blah-0:25420:25420 [0] NCCL INFO NET/Socket : Using [0]eth0:10.106.165.46<0>
blah-0:25420:25420 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
blah-0:25420:25497 [0] NCCL INFO bootstrap.cc:107 Mem Alloc Size 28 pointer 0x7fedf8000b20
blah-0:25420:25498 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7fedfc001200
blah-0:25420:25498 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7fedfc00b440
blah-0:25420:25498 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7fee01800000
blah-0:25420:25498 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7fedfc00c230
blah-0:25420:25498 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7fedfc058240
blah-0:25420:25498 [0] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7fedfc058290
blah-0:25420:25498 [0] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7fedfc0582b0
blah-0:25420:25498 [0] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7fedfc0582d0
blah-0:25420:25498 [0] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7fedfc0582f0
blah-0:25420:25498 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7fedfc058310
blah-0:25420:25497 [0] NCCL INFO bootstrap.cc:121 Mem Alloc Size 56 pointer 0x7fedf80066e0
blah-0:25420:25497 [0] NCCL INFO bootstrap.cc:122 Mem Alloc Size 56 pointer 0x7fedf8006720

Node 2 output

CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=1 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[ 22473 ] world_size = 2, rank = 1, backend=nccl
model = ToyModel().cuda(device=0) called
blah-1:22473:22473 [0] NCCL INFO Bootstrap : Using eth0:10.106.111.185<0>
blah-1:22473:22473 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
blah-1:22473:22473 [0] NCCL INFO NET/IB : No device found.
blah-1:22473:22473 [0] NCCL INFO NET/Socket : Using [0]eth0:10.106.111.185<0>
blah-1:22473:22473 [0] NCCL INFO Using network Socket
blah-1:22473:22566 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f326c001200
blah-1:22473:22566 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f326c00b440
blah-1:22473:22566 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f3275800000
blah-1:22473:22566 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f326c00c230
blah-1:22473:22566 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f326c058240
blah-1:22473:22566 [0] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7f326c058290
blah-1:22473:22566 [0] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7f326c0582b0
blah-1:22473:22566 [0] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7f326c0582d0
blah-1:22473:22566 [0] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7f326c0582f0
blah-1:22473:22566 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f326c058310

It looks like NCCL is allocating memory, I am making it past the model = ToyModel().cuda(device=0) called step, aka it is getting stuck in the DDP init call.

Any ideas?

Versions

PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.26

Python version: 3.7.5 (default, Feb 23 2021, 13:22:40) [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-4.14.238-182.422.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla T4
GPU 1: Tesla T4
GPU 2: Tesla T4
GPU 3: Tesla T4

Nvidia driver version: 470.57.02
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.4
[pip3] torch==1.11.0
[conda] Could not collect

ptrblck · March 18, 2022, 7:03am

I guess these messages indicate the failure:

[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).

Asciotti53 · March 18, 2022, 1:27pm

They looked menacing but they occur even in the basic cases during the init_process_group stage and do not prevent the scripts from finishing. It also appears that the first one at least is for ip6 addrs (the [::] notation) which shouldn’t really matter if I have ip4 addrs setup and working.

If those addresses/sockets were failing, I would anticipate NCCL to stall during initialization, but it doesn’t. We successfully see blah-0:25420:25420 [0] NCCL INFO Bootstrap : Using eth0:10.106.165.46<0> and no other mention of failure.

I can try to dig in more on that though. I am trying to manually debug NCCL + openmpi, but it requires manually re-installing everything since the libraries that come with torch don’t seem to be sufficient to run the development tests.

Asciotti53 · March 29, 2022, 7:54pm

@ptrblck Those issues are in fact a non-issue according to the behavior I have seen and the source code.

Once again, I can actually establish a TCPStore connection, and those warnings appear to be the underlying utils attempting an ipv6 connection (failing) and then falling back to ipv4 (and succeeding). As i called out, NCCL does establish a connection on ipv4.

Utils code here, and I confirmed ipv6 is disabled in my machine’s kernel.

I am able to successfully run nccl tests on my GPUs on a single node (github). Since torch doesn’t use MPI by default, testing out the nccl tests with mpi to cross nodes seems not entirely how reflective of pytorch’s distributed setup.

Are you aware of any other issues similar to this? Could you possibly rope in someone else who might be able to assist?

Thank you!

Edit: I linked to a fork of pytorch in the utils.c++ code, will look again at source code on pytorch itself. Looks like there is a series of calls trying to create the socket, ipv6 first, the ipv4 here, so wrong link originally but same logic.

Asciotti53 · March 30, 2022, 12:53pm

To follow up more fully on those c10d logs, I finally figured out how to get the logging setup, and below confirms my suspicions. It fails on ipv6 then fallsback to ipv4. Which means me having issues within the constructor and in NCCL is the source, not the connectivity.

[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (blah-0.blah.ml-dev.svc.cluster.local, 29500).
[I socket.cpp:649] [c10d - trace] The client socket is attempting to connect to [blah-0.blah.ml-dev.svc.cluster.local]:29500.
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:29500 (errno: 97 - Address family not supported by protocol).
[I socket.cpp:590] [c10d - debug] The client socket will attempt to connect to an IPv4 address of (blah-0.blah.ml-dev.svc.cluster.local, 29500).
[I socket.cpp:649] [c10d - trace] The client socket is attempting to connect to blah-0.blah.ml-dev.svc.cluster.local:29500.
[I socket.cpp:725] [c10d] The client socket has connected to blah-0.blah.ml-dev.svc.cluster.local:29500 on blah-0.blah.ml-dev.svc.cluster.local:36722.

Asciotti53 · May 12, 2022, 1:33am

Bumping as Ive isolated issues.

github.com/pytorch/pytorch

all_gather not working with NCCL Backend

opened 05:41PM - 09 May 22 UTC

Asciotti

oncall: distributed module: nccl module: ddp

### 🐛 Describe the bug When using NCCL backend, my code stalls on `all_gather…` when using nodes > 1 (aka multi-nodes) regardless of number of GPUs. However, it does not stall when using 1 node but any number of GPUs. This issue is actually stemming from trying to get `DDP` working; however, the `all_gather` call underneath that initialization led me to here. ** Big note, I cannot get this to run with `Gloo` either but im primarily interested in `nccl` ** ``` import os import torch.distributed as dist import torch.nn as nn import torch if __name__ == "__main__": local_rank = int(os.environ["LOCAL_RANK"]) print("Local rank gpu: ", local_rank) torch.cuda.set_device(local_rank) dist.init_process_group(backend="nccl") tensor_list = [torch.zeros(2, dtype=torch.int64).cuda() for _ in range(2)] tensor = torch.arange(2, dtype=torch.int64).cuda() + 1 + 2 * local_rank print(tensor) dist.all_gather(tensor_list, tensor) print(tensor_list) ``` If I call it single node with any of the invocations below, it works. ### Single node, 1 gpu (I change the list to only generate/collect 1object in this case) ``` NCCL_SHM_DISABLE=1 CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr="ddp-0.ddp.ml-dev.svc.cluster.local" --master_port=12345 ddp.py ``` ### Single node, 2 gpus ``` NCCL_SHM_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="ddp-0.ddp.ml-dev.svc.cluster.local" --master_port=12345 ddp.py ``` ### Single node, 2 gpus, each called from different terminal ``` # terminal 0 NCCL_SHM_DISABLE=1 CUDA_VISIBLE_DEVICES=0 bNCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="ddp-0.ddp.ml-dev.svc.cluster.local" --master_port=12345 ddp.py ``` ``` # terminal 1 NCCL_SHM_DISABLE=1 CUDA_VISIBLE_DEVICES=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="ddp-0.ddp.ml-dev.svc.cluster.local" --master_port=12345 ddp.py ``` ### Multiple node, single gpu on each (so 2 total) As soon as I move to multiple nodes it stalls in the `all_gather` ``` # node 0 NCCL_SHM_DISABLE=1 CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="ddp-0.ddp.ml-dev.svc.cluster.local" --master_port=12345 ddp.py ``` ``` # node 1 NCCL_SHM_DISABLE=1 CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="ddp-0.ddp.ml-dev.svc.cluster.local" --master_port=12345 ddp.py ``` Logs on node 1: ``` tensor([1, 2], device='cuda:0') ddp-1:2766:2766 [0] NCCL INFO Bootstrap : Using eth0:10.106.88.167<0> ddp-1:2766:2766 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation ddp-1:2766:2766 [0] NCCL INFO NET/IB : No device found. ddp-1:2766:2766 [0] NCCL INFO NET/Socket : Using [0]eth0:10.106.88.167<0> ddp-1:2766:2766 [0] NCCL INFO Using network Socket ddp-1:2766:2798 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f1dd8001200 ddp-1:2766:2798 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f1dd800b440 ddp-1:2766:2798 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f1ddd800000 ddp-1:2766:2798 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f1dd800c230 ddp-1:2766:2798 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f1dd8058240 ddp-1:2766:2798 [0] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7f1dd8058290 ddp-1:2766:2798 [0] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7f1dd80582b0 ddp-1:2766:2798 [0] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7f1dd80582d0 ddp-1:2766:2798 [0] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7f1dd80582f0 ddp-1:2766:2798 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f1dd8058310 ... it freezes ``` I looked at https://github.com/pytorch/pytorch/issues/18689 and https://github.com/pytorch/pytorch/issues/75619 added the solutions to my script. ### Versions ``` Collecting environment information... PyTorch version: 1.11.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.6 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.26 Python version: 3.7.5 (default, Feb 23 2021, 13:22:40) [GCC 8.4.0] (64-bit runtime) Python platform: Linux-4.14.238-182.422.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: Tesla T4 GPU 1: Tesla T4 GPU 2: Tesla T4 GPU 3: Tesla T4 Nvidia driver version: 470.57.02 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] torch==1.11.0 [conda] Could not collect ``` cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

Divyansh_Shivashok · October 13, 2023, 5:00pm

Hi Asciotti,
I am encountering the same issue with running distributed training on my server. How did you fix this error?