Here is the setup:
- 1 node containing 8 H100 GPUs.
- PyTorch 2.5.0
- CUDA 11.8
- PyTorch Lightning 2.4.0
- NCCL 2.20.5
- Batchsize/GPU = 1
The training hangs at a random step during the backward operation.
At a random iteration 1 out of 8 ranks doesn’t manage to pass the backward. I confirmed that by logging in the callback on_before_backward and on_after_backward.
One thing to note is that I made it work by changing the size of the model, either bigger and smaller. I don’t think it comes from the data as I’m passing the same input over and over for testing purposes.
I’ve inspected the hanging process by running gdb -p {pid of the hanging process}
and found these two relevant threads:
The main thread:
(gdb) bt
#0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55a94582fad8) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55a94582fae0, cond=0x55a94582fab0) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x55a94582fab0, mutex=0x55a94582fae0) at pthread_cond_wait.c:647
#3 0x00007ff5b7a51e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ff5f83f111b in torch::autograd::ReadyQueue::pop() ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#5 0x00007ff5f83f5f0d in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#6 0x00007ff5f83f07e7 in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) () from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#7 0x00007ff60837e329 in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_python.so
#8 0x00007ff5f83f3c2d in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#9 0x00007ff60837e28e in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_python.so
#10 0x00007ff60837c3d4 in THPEngine_run_backward(_object*, _object*, _object*) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_python.so
#11 0x00007ff609f9d67a in cfunction_call (func=0x7ff181607830, args=0x80, kwargs=0x0) at Objects/methodobject.c:543
#12 0x00007ff609fdae73 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c5007c0, throwflag=<optimized out>) at Objects/call.c:305
#13 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#14 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff421b8, oparg=<optimized out>, kwnames=0x0)
*at ./Include/cpython/abstract.h:114*
#15 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c75ae40, throwflag=<optimized out>) at Python/ceval.c:4231
#16 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#17 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff425d8, oparg=<optimized out>, kwnames=0x0)
*at ./Include/cpython/abstract.h:114*
#18 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c6c5780, throwflag=<optimized out>) at Python/ceval.c:4231
#19 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#20 0x00007ff609f7bd19 in method_vectorcall (method=<optimized out>, args=0x7ff609818088, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#21 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c7098c0, throwflag=<optimized out>) at Objects/call.c:255
#22 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#23 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff18132ac68, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#24 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff1838e4780, throwflag=<optimized out>) at Objects/call.c:255
#25 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#26 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff55c4b6e58, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#27 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c709000, throwflag=<optimized out>) at Objects/call.c:255
#28 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#29 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff55c7133d8, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#30 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff1811c0e40, throwflag=<optimized out>) at Objects/call.c:255
#31 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
--Type <RET> for more, q to quit, c to continue without paging--
*at ./Include/internal/pycore_ceval.h:46*
#32 0x00007ff609fdaf2b in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c5009a0, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#33 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#34 0x00007ff609fdaf2b in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff18113ca60, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#35 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#36 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff181149120, throwflag=<optimized out>) at Objects/call.c:255
#37 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#38 0x00007ff609f7bd19 in method_vectorcall (method=<optimized out>, args=0x7ff609818088, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#39 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff18113c8b0, throwflag=<optimized out>) at Objects/call.c:255
#40 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#41 0x00007ff60a070cc6 in _PyObject_Call_Prepend (tstate=0x55a7de58e3c0, callable=0x7ff1cb5c5d80, obj=<optimized out>, args=<optimized out>, kwargs=0x0)
*at Objects/call.c:142*
#42 0x00007ff60a0a5e8a in slot_tp_call (self=0x7ff181575870, args=0x7ff609818070, kwds=0x0) at Objects/typeobject.c:7494
#43 0x00007ff609fff186 in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff449e0, oparg=<optimized out>, kwnames=<optimized out>)
*at Objects/call.c:215*
#44 0x00007ff609fd4c27 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a686700, throwflag=<optimized out>) at Python/ceval.c:4213
#45 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#46 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x7ff180f3ad58, nargsf=<optimized out>, kwnames=0x7ff55c482480)
*at ./Include/cpython/abstract.h:114*
#47 0x00007ff609ffec72 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a68e480, throwflag=<optimized out>) at Objects/call.c:267
#48 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#49 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff18107cfd8, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#50 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff181148f40, throwflag=<optimized out>) at Objects/call.c:255
#51 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#52 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff18134e5d8, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#53 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a6bf4c0, throwflag=<optimized out>) at Objects/call.c:255
#54 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#55 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x7ff13a6bf490, nargsf=<optimized out>, kwnames=0x7ff1cb63bc70)
*at ./Include/cpython/abstract.h:114*
#56 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff45aa8, oparg=<optimized out>, kwnames=0x0)
*at ./Include/cpython/abstract.h:114*
#57 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a6bf300, throwflag=<optimized out>) at Python/ceval.c:4231
#58 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#59 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff181677d98, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#60 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff55c376e40, throwflag=<optimized out>) at Objects/call.c:255
#61 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#62 0x00007ff609fdaf2b in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a686510, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#63 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#64 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x55a826d37fd0, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#65 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#66 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x55a8af244200, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#67 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#68 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff181148040, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#69 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#70 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a6ebe20, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#71 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#72 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a629e00, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#73 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#74 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff13a6298c0, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#75 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#76 0x00007ff609fd8d39 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff19f75ed40, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#77 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#78 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x7ff19f75ecf8, nargsf=<optimized out>, kwnames=0x7ff1cb693130)
*at ./Include/cpython/abstract.h:114*
#79 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff47fd8, oparg=<optimized out>, kwnames=0x0)
*at ./Include/cpython/abstract.h:114*
#80 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff19f75eb60, throwflag=<optimized out>) at Python/ceval.c:4231
#81 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#82 0x00007ff609f7bca0 in method_vectorcall (method=<optimized out>, args=0x7ff181683578, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#83 0x00007ff609fd8761 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff1812fea40, throwflag=<optimized out>) at Objects/call.c:255
#84 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#85 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x7ff1814693f8, nargsf=<optimized out>, kwnames=0x7ff183833310)
*at ./Include/cpython/abstract.h:114*
#86 0x00007ff609ffec72 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff183967440, throwflag=<optimized out>) at Objects/call.c:267
#87 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#88 0x00007ff609fdaf2b in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff1837c2cd0, throwflag=<optimized out>) at ./Include/cpython/abstract.h:114
#89 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#90 0x00007ff609f7bbeb in method_vectorcall (method=<optimized out>, args=0x55a7ece04930, nargsf=<optimized out>, kwnames=0x7ff60951c9c0)
*at ./Include/cpython/abstract.h:114*
#91 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff49028, oparg=<optimized out>, kwnames=0x0)
*at ./Include/cpython/abstract.h:114*
#92 0x00007ff609fd42a2 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x55a7ece04660, throwflag=<optimized out>) at Python/ceval.c:4231
#93 0x00007ff609f797dc in _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
*at ./Include/internal/pycore_ceval.h:46*
#94 0x00007ff609ffefcc in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffe6ff49450, oparg=<optimized out>, kwnames=0x0)
*at ./Include/cpython/abstract.h:114*
#95 0x00007ff609fd4c27 in _PyEval_EvalFrameDefault (tstate=0x55a7de58e3c0, f=0x7ff6097f4440, throwflag=<optimized out>) at Python/ceval.c:4213
#96 0x00007ff60a0d73d3 in _PyEval_EvalFrame (tstate=0x55a7de58e3c0, f=0x7ff6097f4440, throwflag=0) at ./Include/internal/pycore_ceval.h:46
#97 _PyEval_Vector (tstate=0x55a7de58e3c0, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>)
*at Python/ceval.c:5067*
#98 0x00007ff60a10de72 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ff6097534c0, locals=<optimized out>, flags=<optimized out>,
arena=<optimized out>) at Python/ceval.c:1134
#99 0x00007ff609eb2f0f in pyrun_file (fp=0x55a7de592fb0, filename=0x7ff6096361f0, start=<optimized out>, globals=0x7ff6097534c0, locals=0x7ff6097534c0, closeit=1,
flags=<optimized out>) at Python/pythonrun.c:1208
#100 0x00007ff609eb2a30 in _PyRun_SimpleFileObject (fp=0x55a7de592fb0, filename=0x7ff6096361f0, closeit=1, flags=0x7ffe6ff496d0) at Python/pythonrun.c:456
#101 0x00007ff609eb23fd in _PyRun_AnyFileObject (fp=0x55a7de592fb0, filename=0x7ff6096361f0, closeit=1, flags=0x7ffe6ff496d0) at Python/pythonrun.c:90
#102 0x00007ff609ebbfa9 in pymain_run_file_obj (program_name=0x7ff6096368d0, filename=0x7ff6096361f0, skip_source_first_line=0) at Modules/main.c:353
#103 0x00007ff609ebbb12 in pymain_run_file (config=0x55a7de571bc0) at Modules/main.c:372
#104 0x00007ff60a11a3f3 in Py_RunMain () at Modules/main.c:587
#105 0x00007ff609ebc095 in pymain_main (args=<optimized out>) at Modules/main.c:696
#106 0x00007ff609ebc347 in Py_BytesMain (argc=<optimized out>, argv=0x80) at Modules/main.c:720
#107 0x00007ff60996d083 in __libc_start_main (main=0x55a7dcd49060 <main>, argc=12, argv=0x7ffe6ff49958, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7ffe6ff49948) at ../csu/libc-start.c:308
#108 0x000055a7dcd4908e in _start ()
When I check the pt_autograd_{rank}
threads, and more specifically the one related to the hanging rank I find:
(gdb) thread 146
(gdb) bt
#0 0x00007ff609a4b71b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
#1 0x00007ff5638c67fc in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007ff563c31b05 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ff563b677c4 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007ff563b683c7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007ff563c2c1e0 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6 0x00007ff563a518e2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7 0x00007ff563a52ee4 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8 0x00007ff563a9c3c2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9 0x00007ff56397b650 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#10 0x00007ff563c293ae in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#11 0x00007ff563889746 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#12 0x00007ff563889c60 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#13 0x00007ff56388ad77 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#14 0x00007ff563a341a1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#15 0x00007ff608f43441 in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0
#16 0x00007ff608f166fd in ?? () from /usr/local/cuda/lib64/libcudart.so.11.0
#17 0x00007ff608f686a5 in cudaMemcpyAsync () from /usr/local/cuda/lib64/libcudart.so.11.0
#18 0x00007ff5b990a86b in at::native::copy_device_to_device(at::TensorIterator&, bool, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cuda.so
#19 0x00007ff5b990cf22 in at::native::copy_kernel_cuda(at::TensorIterator&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cuda.so
#20 0x00007ff5f4641a1e in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007ff5f464331a in at::native::copy_(at::Tensor&, at::Tensor const&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007ff5f548751f in at::_ops::copy_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#23 0x00007ff5f8dc2935 in torch::ADInplaceOrView::copy_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007ff5f548751f in at::_ops::copy_::redispatch(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007ff5f8dc3d20 in torch::autograd::VariableType::(anonymous namespace)::copy_(c10::DispatchKeySet, at::Tensor&, at::Tensor const&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007ff5f54f970f in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007ff5f8ebe680 in std::_Function_handler<bool (at::Tensor&), c10d::Reducer::copy_bucket_to_grad(at::Tensor&, c10d::Reducer::Bucket&, unsigned long, bool)::{lambda(auto:1&)#1}>::_M_invoke(std::_Any_data const&, at::Tensor&) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007ff5f8ebd9b3 in c10d::Reducer::copy_bucket_to_grad(at::Tensor&, c10d::Reducer::Bucket&, unsigned long, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#29 0x00007ff5f8ecdb80 in c10d::Reducer::finalize_bucket_dense(c10d::Reducer::Bucket&) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#30 0x00007ff5f8ece254 in c10d::Reducer::finalize_backward() ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#31 0x00007ff5f8ecec69 in std::_Function_handler<void (), c10d::Reducer::mark_variable_ready(unsigned long)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#32 0x00007ff5f83eefd0 in torch::autograd::GraphTask::exec_post_processing() ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#33 0x00007ff5f83f14ff in torch::autograd::GraphTask::mark_as_completed_and_run_post_processing() ()
--Type <RET> for more, q to quit, c to continue without paging--
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#34 0x00007ff5f83f60ba in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#35 0x00007ff5f83efe94 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_cpu.so
#36 0x00007ff60837de70 in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()
from /workdir/train_docker.binary.runfiles/pip-custom_torch/site-packages/torch/lib/libtorch_python.so
#37 0x00007ff5b7a57df4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#38 0x00007ff609ca9609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#39 0x00007ff609a68353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
The execution of cudaMemcpyAsync
hangs for some reason. Eventually, this rank can’t join the next collective and an NCCL timeout arises.
Unfortunately, I can’t give you a reproducible example.
Do you have any idea of what can cause this operation to hang and how to debug it further?