Hello everyone! I tried solving this issue on my own but after a few days of trying to do so I have to concede… Admittedly, I am no expert when it comes to Linux in general and this is my first time working in a high performance computing environment.
Although I was able to utilise DDP with NCCL in the past in order to train my models, I noticed a few days ago that I would get weird errors (something along the lines of size could not be broadcast) which I did not get when training my models a month ago. I did change PyTorch and Python versions since then, which is why I wanted to eliminate as many variables as possible and decided to work on a toy example taken from the DDP guide.
The following is the code I use:
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def demo_basic():
dist.init_process_group("nccl")
rank = dist.get_rank()
print(f"Start running basic DDP example on rank {rank}.")
# create model and move it to GPU with id rank
device_id = rank % torch.cuda.device_count()
model = ToyModel().to(device_id)
print("Initialising DDP")
ddp_model = DDP(model, device_ids=[device_id])
print("Initialised DDP")
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(device_id)
loss_fn(outputs, labels).backward()
optimizer.step()
dist.destroy_process_group()
if __name__ == "__main__":
print(f"Running PyTorch {torch.__version__} with CUDA {torch.version.cuda} and NCCL {torch.cuda.nccl.version()}")
demo_basic()
which I run using this command:
torchrun --nnodes 2 --nproc_per_node 1 --rdzv-backend c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT path/to/script.py
The script works perfectly fine when I run it on two nodes with one V100 GPU each using NCCL or when using the Gloo backend with A100 GPUs, but I just cannot get it to work using NCCL and two nodes with one (or more, this is just for debugging purposes) A100 each. This is the error log I am receiving:
Running PyTorch 2.2.2+cu121 with CUDA 12.1 and NCCL (2, 19, 3)
Start running basic DDP example on rank 1.
Initialising DDP
hpc-g4-1:52582:52582 [0] NCCL INFO cudaDriverVersion 12020
hpc-g4-1:52582:52582 [0] NCCL INFO Bootstrap : Using eth0:123.45.67.89<0>
hpc-g4-1:52582:52582 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hpc-g4-1:52582:52592 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth0:123.45.67.89<0>
hpc-g4-1:52582:52592 [0] NCCL INFO Using non-device net plugin version 0
hpc-g4-1:52582:52592 [0] NCCL INFO Using network IB
hpc-g4-1:52582:52592 [0] NCCL INFO comm 0x8aef670 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 60 commId 0xe200dbb906c36c4f - Init START
hpc-g4-1:52582:52592 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
hpc-g4-1:52582:52592 [0] NCCL INFO P2P Chunksize set to 131072
hpc-g4-1:52582:52592 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IB/0
hpc-g4-1:52582:52592 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IB/0
hpc-g4-1:52582:52592 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/IB/0
hpc-g4-1:52582:52592 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IB/0
hpc-g4-1:52582:52594 [0] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net_ib.cc:520 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net_ib.cc:647 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net.cc:677 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO transport/net.cc:304 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO transport.cc:148 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO init.cc:1117 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO init.cc:1396 -> 2
hpc-g4-1:52582:52592 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
hpc-g4-1:52582:52582 [0] NCCL INFO group.cc:418 -> 2
hpc-g4-1:52582:52582 [0] NCCL INFO group.cc:95 -> 2
hpc-g4-1:52582:52594 [0] misc/ibvwrap.cc:190 NCCL WARN Call to ibv_create_cq failed with error Cannot allocate memory
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net_ib.cc:520 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net_ib.cc:647 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO transport/net.cc:677 -> 2
hpc-g4-1:52582:52594 [0] NCCL INFO misc/socket.cc:47 -> 3
hpc-g4-1:52582:52594 [0] NCCL INFO misc/socket.cc:58 -> 3
hpc-g4-1:52582:52594 [0] NCCL INFO misc/socket.cc:773 -> 3
hpc-g4-1:52582:52594 [0] NCCL INFO proxy.cc:1374 -> 3
hpc-g4-1:52582:52594 [0] NCCL INFO proxy.cc:1415 -> 3
hpc-g4-1:52582:52594 [0] proxy.cc:1557 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 3
Traceback (most recent call last):
File "/users/felix.schoen/data/projects/Project/./meta/playground/ddp/ddp.py", line 45, in <module>
demo_basic()
File "/users/felix.schoen/data/projects/Project/./meta/playground/ddp/ddp.py", line 29, in demo_basic
ddp_model = DDP(model, device_ids=[device_id])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 798, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Call to ibv_create_cq failed with error Cannot allocate memory
hpc-g4-1:52582:52582 [0] NCCL INFO comm 0x8aef670 rank 1 nranks 2 cudaDev 0 busId 60 - Abort COMPLETE
[2024-04-23 00:11:00,951] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 52582) of binary: /users/felix.schoen/data/projects/Project/venv/bin/python3.11
Traceback (most recent call last):
File "/users/felix.schoen/data/projects/Project/venv/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/users/felix.schoen/data/projects/Project/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./meta/playground/ddp/ddp.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-23_00:11:00
host : hpc-g4-1.domain.com
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 52582)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
ulimit -l
is already set to unlimited. According to the official documentation NCCL 2.19.3 (the version PyTorch ships with) doesn’t even support CUDA 12.1, although I find that hard to believe. According to nvidia-smi
the current driver version installed is 535.129.03
with support for CUDA 12.2.
I’d be grateful for pointers on how to solve this, thanks in advance!