Hi all, I am trying to get a basic multi-node training example working. In my case, the DDP constructor is hanging; however, NCCL logs imply what appears to be memory being allocated in the underlying cuda area (?).
I have verified telnet
and nc
connection between all my ports between my two machines, for the record.
I have looked through the following related forum posts: 89711 which doesn’t seem to have a resolution in it, 123220 which just tells people to look at the official docs, which only have an example for single-node setups; as well as the other github issues: 3 I actually make it past the NCCL init
stage and memory is allocated according to the logs.
Some other things I have tried with no success
- Install pytorch-nightly build for my torch + cuda version
NCCL_P2P_DISABLE=1
Below is my code snippet.
import os
import torch.distributed as dist
import datetime
import sys
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel
dist.init_process_group(backend="nccl")
print(f"[ {os.getpid()} ] world_size = {dist.get_world_size()}, " + f"rank = {dist.get_rank()}, backend={dist.get_backend()}")
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net = nn.Linear(10, 5)
def forward(self, x):
return self.net(self.relu(self.net1(x)))
model = ToyModel().cuda(device=0)
print('model = ToyModel().cuda(device=0) called')
ddp_model = DistributedDataParallel(model, device_ids=[0], output_device=0)
print('ddp_model = DistributedDataParallel(model, device_ids=[0]) called')
#node 1
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=0 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py
# node 2
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=1 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py
Node 1 output
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=0 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[ 25420 ] world_size = 2, rank = 0, backend=nccl
model = ToyModel().cuda(device=0) called
blah-0:25420:25420 [0] NCCL INFO Bootstrap : Using eth0:10.106.165.46<0>
blah-0:25420:25420 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
blah-0:25420:25420 [0] NCCL INFO NET/IB : No device found.
blah-0:25420:25420 [0] NCCL INFO NET/Socket : Using [0]eth0:10.106.165.46<0>
blah-0:25420:25420 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda10.2
blah-0:25420:25497 [0] NCCL INFO bootstrap.cc:107 Mem Alloc Size 28 pointer 0x7fedf8000b20
blah-0:25420:25498 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7fedfc001200
blah-0:25420:25498 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7fedfc00b440
blah-0:25420:25498 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7fee01800000
blah-0:25420:25498 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7fedfc00c230
blah-0:25420:25498 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7fedfc058240
blah-0:25420:25498 [0] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7fedfc058290
blah-0:25420:25498 [0] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7fedfc0582b0
blah-0:25420:25498 [0] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7fedfc0582d0
blah-0:25420:25498 [0] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7fedfc0582f0
blah-0:25420:25498 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7fedfc058310
blah-0:25420:25497 [0] NCCL INFO bootstrap.cc:121 Mem Alloc Size 56 pointer 0x7fedf80066e0
blah-0:25420:25497 [0] NCCL INFO bootstrap.cc:122 Mem Alloc Size 56 pointer 0x7fedf8006720
Node 2 output
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL python -m torch.distributed.run --nnodes=2 --node_rank=1 --nproc_per_node 1 --max_restarts 0 --master_addr="blah-0.blah.ml-dev.svc.cluster.local" --master_port=12345 ddp.py
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [blah-0.blah.ml-dev.svc.cluster.local]:12345 (errno: 97 - Address family not supported by protocol).
[ 22473 ] world_size = 2, rank = 1, backend=nccl
model = ToyModel().cuda(device=0) called
blah-1:22473:22473 [0] NCCL INFO Bootstrap : Using eth0:10.106.111.185<0>
blah-1:22473:22473 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
blah-1:22473:22473 [0] NCCL INFO NET/IB : No device found.
blah-1:22473:22473 [0] NCCL INFO NET/Socket : Using [0]eth0:10.106.111.185<0>
blah-1:22473:22473 [0] NCCL INFO Using network Socket
blah-1:22473:22566 [0] NCCL INFO init.cc:260 Mem Alloc Size 18872 pointer 0x7f326c001200
blah-1:22473:22566 [0] NCCL INFO misc/utils.cc:30 Mem Alloc Size 12 pointer 0x7f326c00b440
blah-1:22473:22566 [0] NCCL INFO init.cc:279 Cuda Host Alloc Size 4 pointer 0x7f3275800000
blah-1:22473:22566 [0] NCCL INFO init.cc:286 Mem Alloc Size 311296 pointer 0x7f326c00c230
blah-1:22473:22566 [0] NCCL INFO include/enqueue.h:50 Mem Alloc Size 24 pointer 0x7f326c058240
blah-1:22473:22566 [0] NCCL INFO init.cc:305 Mem Alloc Size 8 pointer 0x7f326c058290
blah-1:22473:22566 [0] NCCL INFO init.cc:306 Mem Alloc Size 8 pointer 0x7f326c0582b0
blah-1:22473:22566 [0] NCCL INFO init.cc:309 Mem Alloc Size 16 pointer 0x7f326c0582d0
blah-1:22473:22566 [0] NCCL INFO init.cc:310 Mem Alloc Size 16 pointer 0x7f326c0582f0
blah-1:22473:22566 [0] NCCL INFO bootstrap.cc:330 Mem Alloc Size 128 pointer 0x7f326c058310
It looks like NCCL is allocating memory, I am making it past the model = ToyModel().cuda(device=0) called
step, aka it is getting stuck in the DDP init call.
Any ideas?
Versions
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.26
Python version: 3.7.5 (default, Feb 23 2021, 13:22:40) [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-4.14.238-182.422.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla T4
GPU 1: Tesla T4
GPU 2: Tesla T4
GPU 3: Tesla T4
Nvidia driver version: 470.57.02
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.21.4
[pip3] torch==1.11.0
[conda] Could not collect