DDP training hangs on one rank during backward on H100s

Here is the setup:

  • 1 node containing 8 H100 GPUs.
  • PyTorch 2.5.0
  • CUDA 11.8
  • PyTorch Lightning 2.4.0
  • NCCL 2.20.5
  • Batchsize/GPU = 1

The training hangs at a random step during the backward operation.

At a random iteration 1 out of 8 ranks doesn’t manage to pass the backward. I confirmed that by logging in the callback on_before_backward and on_after_backward.

One thing to note is that I made it work by changing the size of the model, either bigger and smaller. I don’t think it comes from the data as I’m passing the same input over and over for testing purposes.

I’ve inspected the hanging process by running gdb -p {pid of the hanging process} and found these two relevant threads:

The main thread:

(gdb) bt

#0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55a94582fad8) at ../sysdeps/nptl/futex-internal.h:183

#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55a94582fae0, cond=0x55a94582fab0) at pthread_cond_wait.c:508

#2 __pthread_cond_wait (cond=0x55a94582fab0, mutex=0x55a94582fae0) at pthread_cond_wait.c:647

#3 0x00007ff5b7a51e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#4 0x00007ff5f83f111b in torch::autograd::ReadyQueue::pop() ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#5 0x00007ff5f83f5f0d in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#6 0x00007ff5f83f07e7 in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) () from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#7 0x00007ff60837e329 in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_python.so

#8 0x00007ff5f83f3c2d in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#9 0x00007ff60837e28e in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_python.so

#10 0x00007ff60837c3d4 in THPEngine_run_backward(_object*, _object*, _object*) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_python.so

#11 0x00007ff609f9d67a in cfunction_call (func=0x7ff181607830, args=0x80, kwargs=0x0) at Objects/methodobject.c:543

#12 0x00007ff609fdae73 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c5007c0, throwflag=<optimized out>) at Objects/call.c:305

#13 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#14 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff421b8, oparg=<optimized out>, kwnames=0x0)

*at ./Include/cpython/abstract.h:114*

#15 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c75ae40, throwflag=<optimized out>) at Python/ceval.c:4231

#16 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#17 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff425d8, oparg=<optimized out>, kwnames=0x0)

*at ./Include/cpython/abstract.h:114*

#18 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c6c5780, throwflag=<optimized out>) at Python/ceval.c:4231

#19 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#20 0x00007ff609f7bd19 in method_vectorcall (method=<optimized out>, args=0x7ff609818088, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114

#21 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c7098c0, throwflag=<optimized out>) at Objects/call.c:255

#22 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#23 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff18132ac68, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114

#24 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff1838e4780, throwflag=<optimized out>) at Objects/call.c:255

#25 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#26 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff55c4b6e58, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114

#27 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c709000, throwflag=<optimized out>) at Objects/call.c:255

#28 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#29 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff55c7133d8, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114

#30 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff1811c0e40, throwflag=<optimized out>) at Objects/call.c:255

#31 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

--Type <RET> for more, q to quit, c to continue without paging--

*at ./Include/internal/pycore_ceval.h:46*

#32 0x00007ff609fdaf2b in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c5009a0, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#33 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#34 0x00007ff609fdaf2b in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff18113ca60, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#35 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#36 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff181149120, throwflag=<optimized out>) at Objects/call.c:255

#37 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#38 0x00007ff609f7bd19 in method_vectorcall (method=<optimized out>, args=0x7ff609818088, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114

#39 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff18113c8b0, throwflag=<optimized out>) at Objects/call.c:255

#40 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#41 0x00007ff60a070cc6 in _PyObject_Call_Prepend (tstate=0x55a7de58e3c0, callable=0x7ff1cb5c5d80, obj=<optimized out>, args=<optimized out>, kwargs=0x0)

*at Objects/call.c:142*

#42 0x00007ff60a0a5e8a in slot_tp_call (self=0x7ff181575870, args=0x7ff609818070, kwds=0x0) at Objects/typeobject.c:7494

#43 0x00007ff609fff186 in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff449e0, oparg=<optimized out>, kwnames=<optimized out>)

*at Objects/call.c:215*

#44 0x00007ff609fd4c27 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a686700, throwflag=<optimized out>) at Python/ceval.c:4213

#45 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#46 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x7ff180f3ad58, nargsf=<optimized out>, kwnames=0x7ff55c482480)

*at ./Include/cpython/abstract.h:114*

#47 0x00007ff609ffec72 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a68e480, throwflag=<optimized out>) at Objects/call.c:267

#48 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#49 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff18107cfd8, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114

#50 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff181148f40, throwflag=<optimized out>) at Objects/call.c:255

#51 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#52 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff18134e5d8, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114

#53 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a6bf4c0, throwflag=<optimized out>) at Objects/call.c:255

#54 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#55 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x7ff13a6bf490, nargsf=<optimized out>, kwnames=0x7ff1cb63bc70)

*at ./Include/cpython/abstract.h:114*

#56 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff45aa8, oparg=<optimized out>, kwnames=0x0)

*at ./Include/cpython/abstract.h:114*

#57 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a6bf300, throwflag=<optimized out>) at Python/ceval.c:4231

#58 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#59 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff181677d98, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114

#60 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c376e40, throwflag=<optimized out>) at Objects/call.c:255

#61 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#62 0x00007ff609fdaf2b in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a686510, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#63 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#64 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x55a826d37fd0, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#65 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#66 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x55a8af244200, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#67 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#68 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff181148040, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#69 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#70 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a6ebe20, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#71 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#72 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a629e00, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#73 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#74 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a6298c0, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#75 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#76 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff19f75ed40, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#77 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#78 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x7ff19f75ecf8, nargsf=<optimized out>, kwnames=0x7ff1cb693130)

*at ./Include/cpython/abstract.h:114*

#79 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff47fd8, oparg=<optimized out>, kwnames=0x0)

*at ./Include/cpython/abstract.h:114*

#80 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff19f75eb60, throwflag=<optimized out>) at Python/ceval.c:4231

#81 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#82 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff181683578, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114

#83 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff1812fea40, throwflag=<optimized out>) at Objects/call.c:255

#84 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#85 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x7ff1814693f8, nargsf=<optimized out>, kwnames=0x7ff183833310)

*at ./Include/cpython/abstract.h:114*

#86 0x00007ff609ffec72 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff183967440, throwflag=<optimized out>) at Objects/call.c:267

#87 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#88 0x00007ff609fdaf2b in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff1837c2cd0, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114

#89 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#90 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x55a7ece04930, nargsf=<optimized out>, kwnames=0x7ff60951c9c0)

*at ./Include/cpython/abstract.h:114*

#91 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff49028, oparg=<optimized out>, kwnames=0x0)

*at ./Include/cpython/abstract.h:114*

#92 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x55a7ece04660, throwflag=<optimized out>) at Python/ceval.c:4231

#93 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)

*at ./Include/internal/pycore_ceval.h:46*

#94 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff49450, oparg=<optimized out>, kwnames=0x0)

*at ./Include/cpython/abstract.h:114*

#95 0x00007ff609fd4c27 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff6097f4440, throwflag=<optimized out>) at Python/ceval.c:4213

#96 0x00007ff60a0d73d3 in _PyEval_EvalFrame (tstate=0x55a7de58e3c0, f=0x7ff6097f4440, throwflag=0) at ./Include/internal/pycore_ceval.h:46

#97 _PyEval_Vector (tstate=0x55a7de58e3c0, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>)

*at Python/ceval.c:5067*

#98 0x00007ff60a10de72 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ff6097534c0, locals=<optimized out>, flags=<optimized out>,

arena=<optimized out>) at Python/ceval.c:1134

#99 0x00007ff609eb2f0f in pyrun_file (fp=0x55a7de592fb0, filename=0x7ff6096361f0, start=<optimized out>, globals=0x7ff6097534c0, locals=0x7ff6097534c0, closeit=1,

flags=<optimized out>) at Python/pythonrun.c:1208

#100 0x00007ff609eb2a30 in _PyRun_SimpleFileObject (fp=0x55a7de592fb0, filename=0x7ff6096361f0, closeit=1, flags=0x7ffe6ff496d0) at Python/pythonrun.c:456

#101 0x00007ff609eb23fd in _PyRun_AnyFileObject (fp=0x55a7de592fb0, filename=0x7ff6096361f0, closeit=1, flags=0x7ffe6ff496d0) at Python/pythonrun.c:90

#102 0x00007ff609ebbfa9 in pymain_run_file_obj (program_name=0x7ff6096368d0, filename=0x7ff6096361f0, skip_source_first_line=0) at Modules/main.c:353

#103 0x00007ff609ebbb12 in pymain_run_file (config=0x55a7de571bc0) at Modules/main.c:372

#104 0x00007ff60a11a3f3 in Py_RunMain () at Modules/main.c:587

#105 0x00007ff609ebc095 in pymain_main (args=<optimized out>) at Modules/main.c:696

#106 0x00007ff609ebc347 in Py_BytesMain (argc=<optimized out>, argv=0x80) at Modules/main.c:720

#107 0x00007ff60996d083 in __libc_start_main (main=0x55a7dcd49060 <main>, argc=12, argv=0x7ffe6ff49958, init=<optimized out>, fini=<optimized out>,

rtld_fini=<optimized out>, stack_end=0x7ffe6ff49948) at ../csu/libc-start.c:308

#108 0x000055a7dcd4908e in _start ()

When I check the pt_autograd_{rank} threads, and more specifically the one related to the hanging rank I find:

(gdb) thread 146

(gdb) bt

#0 0x00007ff609a4b71b in sched_yield () at ../sysdeps/unix/syscall-template.S:78

#1 0x00007ff5638c67fc in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#2 0x00007ff563c31b05 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#3 0x00007ff563b677c4 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#4 0x00007ff563b683c7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#5 0x00007ff563c2c1e0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#6 0x00007ff563a518e2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#7 0x00007ff563a52ee4 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#8 0x00007ff563a9c3c2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#9 0x00007ff56397b650 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#10 0x00007ff563c293ae in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#11 0x00007ff563889746 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#12 0x00007ff563889c60 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#13 0x00007ff56388ad77 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#14 0x00007ff563a341a1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

#15 0x00007ff608f43441 in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0

#16 0x00007ff608f166fd in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0

#17 0x00007ff608f686a5 in cudaMemcpyAsync () from /usr/local/cuda/lib64/libcudart.so.11.0

#18 0x00007ff5b990a86b in at::native::copy_device_to_device(at::TensorIterator&, bool, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cuda.so

#19 0x00007ff5b990cf22 in at::native::copy_kernel_cuda(at::TensorIterator&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cuda.so

#20 0x00007ff5f4641a1e in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#21 0x00007ff5f464331a in at::native::copy_(at::Tensor&, at::Tensor const&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#22 0x00007ff5f548751f in at::_ops::copy_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#23 0x00007ff5f8dc2935 in torch::ADInplaceOrView::copy_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#24 0x00007ff5f548751f in at::_ops::copy_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#25 0x00007ff5f8dc3d20 in torch::autograd::VariableType::(anonymous namespace)::copy_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#26 0x00007ff5f54f970f in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#27 0x00007ff5f8ebe680 in std::_Function_handler<bool (at::Tensor&), c10d::Reducer::copy_bucket_to_grad(at::Tensor&, c10d::Reducer::Bucket&, unsigned long, bool)::{lambda(auto:1&)#1}>::_M_invoke(std::_Any_data const&, at::Tensor&) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#28 0x00007ff5f8ebd9b3 in c10d::Reducer::copy_bucket_to_grad(at::Tensor&, c10d::Reducer::Bucket&, unsigned long, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#29 0x00007ff5f8ecdb80 in c10d::Reducer::finalize_bucket_dense(c10d::Reducer::Bucket&) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#30 0x00007ff5f8ece254 in c10d::Reducer::finalize_backward() ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#31 0x00007ff5f8ecec69 in std::_Function_handler<void (), c10d::Reducer::mark_variable_ready(unsigned long)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#32 0x00007ff5f83eefd0 in torch::autograd::GraphTask::exec_post_processing() ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#33 0x00007ff5f83f14ff in torch::autograd::GraphTask::mark_as_completed_and_run_post_processing() ()

--Type <RET> for more, q to quit, c to continue without paging--

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#34 0x00007ff5f83f60ba in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#35 0x00007ff5f83efe94 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so

#36 0x00007ff60837de70 in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()

from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_python.so

#37 0x00007ff5b7a57df4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#38 0x00007ff609ca9609 in start_thread (arg=<optimized out>) at pthread_create.c:477

#39 0x00007ff609a68353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The execution of cudaMemcpyAsync hangs for some reason. Eventually, this rank can’t join the next collective and an NCCL timeout arises.

Unfortunately, I can’t give you a reproducible example.

Do you have any idea of what can cause this operation to hang and how to debug it further?

This isn’t very helpful, but I am also having similar issues, except with A100 GPUs. I have 4 of them, and I am training a small language model. It goes through the eval function first, then the first epoch, where each device only processes 1 batch. Then I put a break statement in just to speed up these tests. After that, it goes back to the eval function - I evaluate at the beginning to get baseline metrics, and then again after every epoch - and this works fine. Then in the second epoch, all devices do the forward pass and the loss calculation, and then only one does the backward pass before the whole program hangs…

Can’t figure out what it is :confused: