I’m following PyTorch DDP tutorial and I have a SLURM cluster that I’m trying to schedule with. When I use 1 node with multiple GPUs I’m able to get it to work. When using 2 nodes with a single GPU each it fails.
The traceback each time is always this:
Traceback (most recent call last):
File "/mnt/project/project/<user>/ddp_slurm_mnist/repro.py", line 45, in <module>
demo_basic()
File "/mnt/project/project/<user>/ddp_slurm_mnist/repro.py", line 30, in demo_basic
ddp_model = DDP(model, device_ids=[device_id])
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/nn/parallel/distributed.py", line 799, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
My repro.py
file is:
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import os
from torch.nn.parallel import DistributedDataParallel as DDP
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def demo_basic():
dist.init_process_group("nccl")
rank = dist.get_rank()
local_rank = int(os.environ['LOCAL_RANK'])
# torch.cuda.set_device(local_rank)
print(f"Start running basic DDP example on rank global={rank}, local={local_rank}.")
# create model and move it to GPU with id rank
device_id = rank % torch.cuda.device_count()
print(f'attempting to place {rank} on {device_id}')
model = ToyModel().to(device_id)
ddp_model = DDP(model, device_ids=[device_id])
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(device_id)
loss_fn(outputs, labels).backward()
optimizer.step()
dist.destroy_process_group()
if __name__ == "__main__":
os.environ['MASTER_PORT'] = os.environ['NEW_MASTER_PORT']
# os.environ['MASTER_ADDR'] = os.environ['RDZV_HOST']
demo_basic()
You might be like “what is this NEW_MASTER_PORT?. That’s sus.” The system I’m on only has certain ports open and when setting the RDZV endpoints I noticed that the MASTER_PORT doesn’t seem to propagate, so I force it to something that’s open for me. This is done in the repro.sh
script which is below:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=10G
#SBATCH --nodes=2
#SBATCH --output=./%x-%j.out
#SBATCH --error=./%x-%j.err
#SBATCH --export=ALL
export NCCL_DEBUG=INFO
export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=INFO
export TORCH_SHOW_CPP_STACKTRACES=1
export TORCH_NCCL_ENABLE_MONITORING=1
export RDZV_HOST=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export RDZV_PORT=59817
export NEW_MASTER_PORT=59882
srun torchrun \
--nnodes=2 \
--nproc_per_node=1 \
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
repro.py
The output is shown below:
Start running basic DDP example on rank global=1, local=0.
attempting to place 1 on 0
Start running basic DDP example on rank global=0, local=0.
attempting to place 0 on 0
attempting to verify shape
<NODE1>:2841810:2841810 [0] NCCL INFO Bootstrap : Using <eno*.64><0>
<NODE1>:2841810:2841810 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
<NODE1>:2841810:2841810 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.19.3+cuda12.3
attempting to verify shape
<NODE1>:2841810:2841823 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB <eno*.64><0>
<NODE1>:2841810:2841823 [0] NCCL INFO Using non-device net plugin version 0
<NODE1>:2841810:2841823 [0] NCCL INFO Using network IB
<NODE2>:3285920:3285920 [0] NCCL INFO cudaDriverVersion 12030
<NODE2>:3285920:3285920 [0] NCCL INFO Bootstrap : Using <eno*.65><0>
<NODE2>:3285920:3285920 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
<NODE2>:3285920:3285931 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB <eno*.65><0>
<NODE2>:3285920:3285931 [0] NCCL INFO Using non-device net plugin version 0
<NODE2>:3285920:3285931 [0] NCCL INFO Using network IB
<NODE2>:3285920:3285931 [0] NCCL INFO misc/socket.cc:568 -> 2
<NODE2>:3285920:3285931 [0] NCCL INFO misc/socket.cc:619 -> 2
<NODE2>:3285920:3285931 [0] NCCL INFO bootstrap.cc:274 -> 2
<NODE2>:3285920:3285931 [0] NCCL INFO init.cc:1388 -> 2
<NODE2>:3285920:3285931 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
<NODE2>:3285920:3285920 [0] NCCL INFO group.cc:418 -> 2
<NODE2>:3285920:3285920 [0] NCCL INFO group.cc:95 -> 2
<NODE2>:3285920:3285920 [0] NCCL INFO comm 0x564083fa1000 rank 1 nranks 2 cudaDev 0 busId 1000 - Abort COMPLETE
And the error log is below:
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:59817.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:59817.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 59817.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (<NODE1>, 59817).
[I socket.cpp:299] [c10d - debug] The server socket on [::]:59817 has accepted a connection from [<NODE1>]:42736.
[I socket.cpp:884] [c10d] The client socket has connected to [<NODE1>]:59817 on [<NODE1>]:42736.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host <NODE1>:59817
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (<NODE1>, 59817).
[I socket.cpp:884] [c10d] The client socket has connected to [<NODE1>]:59817 on [<NODE2>]:42284.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host <NODE1>:59817
[I socket.cpp:299] [c10d - debug] The server socket on [::]:59817 has accepted a connection from [<NODE2>]:42284.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (<NODE1>, 59882).
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:59882.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:59882.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 59882.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (<NODE1>, 59882).
[I socket.cpp:299] [c10d - debug] The server socket on [::]:59882 has accepted a connection from [<NODE1>]:42430.
[I socket.cpp:884] [c10d] The client socket has connected to [<NODE1>]:59882 on [<NODE1>]:42430.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host <NODE1>:59882
[I socket.cpp:299] [c10d - debug] The server socket on [::]:59882 has accepted a connection from [<NODE2>]:46664.
[I socket.cpp:884] [c10d] The client socket has connected to [<NODE1>]:59882 on [<NODE2>]:46664.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host <NODE1>:59882
[I ProcessGroupNCCL.cpp:770] [Rank 1] ProcessGroupNCCL initialization options: NCCL version: 2.19.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: INFO, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 120, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, NCCL_DEBUG: INFO, ID=94835073208368
[I ProcessGroupNCCL.cpp:770] [Rank 0] ProcessGroupNCCL initialization options: NCCL version: 2.19.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: INFO, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 120, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, NCCL_DEBUG: INFO, ID=94066020356688
[rank1]:[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
Traceback (most recent call last):
File "/mnt/project/project/<user>/ddp_slurm_mnist/repro.py", line 45, in <module>
demo_basic()
File "/mnt/project/project/<user>/ddp_slurm_mnist/repro.py", line 30, in demo_basic
ddp_model = DDP(model, device_ids=[device_id])
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/nn/parallel/distributed.py", line 799, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Exception raised from getNCCLComm at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691 (most recent call first):
C++ CapturedTraceback:
#4 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
#5 c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) [clone .cold] from ProcessGroupNCCL.cpp:0
#6 c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&) from ??:0
#7 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long) from Ops.cpp:0
#8 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long), std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#9 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#10 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableFallbackKernel.cpp:0
#11 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long) from :0
#12 c10d::ProcessGroup::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&) from :0
#13 c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::weak_ptr<c10d::Logger> > const&) from ??:0
#14 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#81}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#81}&&, void (*)(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) from init.cpp:0
#15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#16 PyCFunction_Call from ??:0
#17 _PyObject_MakeTpCall from ??:0
#18 _PyEval_EvalFrameDefault from ??:0
#19 _PyFunction_Vectorcall from ??:0
#20 _PyEval_EvalFrameDefault from ??:0
#21 _PyFunction_Vectorcall from ??:0
#22 _PyObject_FastCallDictTstate from ??:0
#23 PyDict_MergeFromSeq2 from ??:0
#24 _PyObject_MakeTpCall from ??:0
#25 _PyEval_EvalFrameDefault from ??:0
#26 _PyDict_GetItemIdWithError from ??:0
#27 _PyFunction_Vectorcall from ??:0
#28 _PyEval_EvalFrameDefault from ??:0
#29 PyList_SetSlice from ??:0
#30 _PyEval_EvalCodeWithName from ??:0
#31 PyEval_EvalCode from ??:0
#32 _PyImport_FixupBuiltin from ??:0
#33 PyAST_CompileObject from ??:0
#34 PyRun_String from ??:0
#35 PyRun_SimpleFileExFlags from ??:0
#36 Py_RunMain from ??:0
#37 Py_BytesMain from ??:0
#38 __libc_start_main from ??:0
#39 _start from ??:0
[rank1]:[I ProcessGroupNCCL.cpp:977] [Rank 1] Destroyed 1communicators on CUDA device 0
[rank0]:[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
[2024-02-21 11:39:59,133] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3285920) of binary: /mnt/project/project/<user>/ddp_slurm_mnist/.venv/bin/python
Traceback (most recent call last):
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
repro.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-21_11:39:59
host : <NODE2>
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 3285920)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
srun: error: <NODE2>: task 1: Exited with exit code 1
This is being ran with sbatch repro.sh
.
The error appears to come from me wrapping DDP(model, device_ids=[device_id])
, specifically as part of the underlying call to _verify_param_shape_across_processes
, but for the life of me I can’t figure out why…