Error when wrapping DDP on two hosts with SLURM + torchrun

I’m following PyTorch DDP tutorial and I have a SLURM cluster that I’m trying to schedule with. When I use 1 node with multiple GPUs I’m able to get it to work. When using 2 nodes with a single GPU each it fails.

The traceback each time is always this:

Traceback (most recent call last):
  File "/mnt/project/project/<user>/ddp_slurm_mnist/repro.py", line 45, in <module>
    demo_basic()
  File "/mnt/project/project/<user>/ddp_slurm_mnist/repro.py", line 30, in demo_basic
    ddp_model = DDP(model, device_ids=[device_id])
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/nn/parallel/distributed.py", line 799, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

My repro.py file is:

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import os
from torch.nn.parallel import DistributedDataParallel as DDP

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    local_rank = int(os.environ['LOCAL_RANK'])
    # torch.cuda.set_device(local_rank)
    print(f"Start running basic DDP example on rank global={rank}, local={local_rank}.")

    # create model and move it to GPU with id rank
    device_id = rank % torch.cuda.device_count()
    print(f'attempting to place {rank} on {device_id}')
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()
    dist.destroy_process_group()

if __name__ == "__main__":
    os.environ['MASTER_PORT'] = os.environ['NEW_MASTER_PORT']
    # os.environ['MASTER_ADDR'] = os.environ['RDZV_HOST']
    demo_basic()

You might be like “what is this NEW_MASTER_PORT?. That’s sus.” The system I’m on only has certain ports open and when setting the RDZV endpoints I noticed that the MASTER_PORT doesn’t seem to propagate, so I force it to something that’s open for me. This is done in the repro.sh script which is below:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=10G
#SBATCH --nodes=2
#SBATCH --output=./%x-%j.out
#SBATCH --error=./%x-%j.err
#SBATCH --export=ALL
export NCCL_DEBUG=INFO
export TORCH_CPP_LOG_LEVEL=INFO 
export TORCH_DISTRIBUTED_DEBUG=INFO 
export TORCH_SHOW_CPP_STACKTRACES=1
export TORCH_NCCL_ENABLE_MONITORING=1

export RDZV_HOST=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export RDZV_PORT=59817
export NEW_MASTER_PORT=59882

srun torchrun \
    --nnodes=2 \
    --nproc_per_node=1 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
    repro.py

The output is shown below:

Start running basic DDP example on rank global=1, local=0.
attempting to place 1 on 0
Start running basic DDP example on rank global=0, local=0.
attempting to place 0 on 0
attempting to verify shape
<NODE1>:2841810:2841810 [0] NCCL INFO Bootstrap : Using <eno*.64><0>
<NODE1>:2841810:2841810 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
<NODE1>:2841810:2841810 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.19.3+cuda12.3
attempting to verify shape
<NODE1>:2841810:2841823 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB <eno*.64><0>
<NODE1>:2841810:2841823 [0] NCCL INFO Using non-device net plugin version 0
<NODE1>:2841810:2841823 [0] NCCL INFO Using network IB
<NODE2>:3285920:3285920 [0] NCCL INFO cudaDriverVersion 12030
<NODE2>:3285920:3285920 [0] NCCL INFO Bootstrap : Using <eno*.65><0>
<NODE2>:3285920:3285920 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
<NODE2>:3285920:3285931 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB <eno*.65><0>
<NODE2>:3285920:3285931 [0] NCCL INFO Using non-device net plugin version 0
<NODE2>:3285920:3285931 [0] NCCL INFO Using network IB
<NODE2>:3285920:3285931 [0] NCCL INFO misc/socket.cc:568 -> 2
<NODE2>:3285920:3285931 [0] NCCL INFO misc/socket.cc:619 -> 2
<NODE2>:3285920:3285931 [0] NCCL INFO bootstrap.cc:274 -> 2
<NODE2>:3285920:3285931 [0] NCCL INFO init.cc:1388 -> 2
<NODE2>:3285920:3285931 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
<NODE2>:3285920:3285920 [0] NCCL INFO group.cc:418 -> 2
<NODE2>:3285920:3285920 [0] NCCL INFO group.cc:95 -> 2
<NODE2>:3285920:3285920 [0] NCCL INFO comm 0x564083fa1000 rank 1 nranks 2 cudaDev 0 busId 1000 - Abort COMPLETE

And the error log is below:

[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:59817.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:59817.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 59817.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (<NODE1>, 59817).
[I socket.cpp:299] [c10d - debug] The server socket on [::]:59817 has accepted a connection from [<NODE1>]:42736.
[I socket.cpp:884] [c10d] The client socket has connected to [<NODE1>]:59817 on [<NODE1>]:42736.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host <NODE1>:59817
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (<NODE1>, 59817).
[I socket.cpp:884] [c10d] The client socket has connected to [<NODE1>]:59817 on [<NODE2>]:42284.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host <NODE1>:59817
[I socket.cpp:299] [c10d - debug] The server socket on [::]:59817 has accepted a connection from [<NODE2>]:42284.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (<NODE1>, 59882).
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:59882.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:59882.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 59882.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (<NODE1>, 59882).
[I socket.cpp:299] [c10d - debug] The server socket on [::]:59882 has accepted a connection from [<NODE1>]:42430.
[I socket.cpp:884] [c10d] The client socket has connected to [<NODE1>]:59882 on [<NODE1>]:42430.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host <NODE1>:59882
[I socket.cpp:299] [c10d - debug] The server socket on [::]:59882 has accepted a connection from [<NODE2>]:46664.
[I socket.cpp:884] [c10d] The client socket has connected to [<NODE1>]:59882 on [<NODE2>]:46664.
[I TCPStore.cpp:342] [c10d - debug] TCP client connected to host <NODE1>:59882
[I ProcessGroupNCCL.cpp:770] [Rank 1] ProcessGroupNCCL initialization options: NCCL version: 2.19.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: INFO, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 120, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, NCCL_DEBUG: INFO, ID=94835073208368
[I ProcessGroupNCCL.cpp:770] [Rank 0] ProcessGroupNCCL initialization options: NCCL version: 2.19.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 0, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: INFO, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 120, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, NCCL_DEBUG: INFO, ID=94066020356688
[rank1]:[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

Traceback (most recent call last):
  File "/mnt/project/project/<user>/ddp_slurm_mnist/repro.py", line 45, in <module>
    demo_basic()
  File "/mnt/project/project/<user>/ddp_slurm_mnist/repro.py", line 30, in demo_basic
    ddp_model = DDP(model, device_ids=[device_id])
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/nn/parallel/distributed.py", line 799, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/utils.py", line 263, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

Exception raised from getNCCLComm at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691 (most recent call first):
C++ CapturedTraceback:
#4 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
#5 c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) [clone .cold] from ProcessGroupNCCL.cpp:0
#6 c10d::ProcessGroupNCCL::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&) from ??:0
#7 c10d::ops::(anonymous namespace)::allgather_CUDA(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long) from Ops.cpp:0
#8 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long), std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#9 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#10 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableFallbackKernel.cpp:0
#11 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long) from :0
#12 c10d::ProcessGroup::allgather(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&) from :0
#13 c10d::verify_params_across_processes(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::weak_ptr<c10d::Logger> > const&) from ??:0
#14 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#81}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::sibling, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&)#81}&&, void (*)(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<std::shared_ptr<c10d::Logger> > const&), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) from init.cpp:0
#15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#16 PyCFunction_Call from ??:0
#17 _PyObject_MakeTpCall from ??:0
#18 _PyEval_EvalFrameDefault from ??:0
#19 _PyFunction_Vectorcall from ??:0
#20 _PyEval_EvalFrameDefault from ??:0
#21 _PyFunction_Vectorcall from ??:0
#22 _PyObject_FastCallDictTstate from ??:0
#23 PyDict_MergeFromSeq2 from ??:0
#24 _PyObject_MakeTpCall from ??:0
#25 _PyEval_EvalFrameDefault from ??:0
#26 _PyDict_GetItemIdWithError from ??:0
#27 _PyFunction_Vectorcall from ??:0
#28 _PyEval_EvalFrameDefault from ??:0
#29 PyList_SetSlice from ??:0
#30 _PyEval_EvalCodeWithName from ??:0
#31 PyEval_EvalCode from ??:0
#32 _PyImport_FixupBuiltin from ??:0
#33 PyAST_CompileObject from ??:0
#34 PyRun_String from ??:0
#35 PyRun_SimpleFileExFlags from ??:0
#36 Py_RunMain from ??:0
#37 Py_BytesMain from ??:0
#38 __libc_start_main from ??:0
#39 _start from ??:0

[rank1]:[I ProcessGroupNCCL.cpp:977] [Rank 1] Destroyed 1communicators on CUDA device 0
[rank0]:[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

[2024-02-21 11:39:59,133] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3285920) of binary: /mnt/project/project/<user>/ddp_slurm_mnist/.venv/bin/python
Traceback (most recent call last):
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/project/project/<user>/ddp_slurm_mnist/.venv/lib64/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
repro.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-21_11:39:59
  host      : <NODE2>
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 3285920)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

srun: error: <NODE2>: task 1: Exited with exit code 1

This is being ran with sbatch repro.sh.

The error appears to come from me wrapping DDP(model, device_ids=[device_id]), specifically as part of the underlying call to _verify_param_shape_across_processes, but for the life of me I can’t figure out why…