DDP doesn't run unless TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled

Hi all,

I’m encountering a strange and fully reproducible issue while migrating a training script from DataParallel (DP) to DistributedDataParallel (DDP). I’m able to run a MWE with DDP successfully, and the same training code runs correctly under DP. However, when I switch my full model to DDP, the script fails or hangs.

What’s particularly confusing is that the exact same code runs successfully if I set TORCH_DISTRIBUTED_DEBUG=DETAIL. Without this flag, the script fails every time. This is not random - the presence or absence of the debug flag always changes the outcome.

The failure appears to happen very early: the script does not even reach the first print statement in forward when it is called. To rule out common synchronization issues, I’ve removed the DistributedSampler, added multiple dist.barrier() calls, and verified that the issue does not appear to be related to the dataloader or input pipeline. A toy CNN model runs fine under DDP without the debug flag, the issue seems specific to my actual CNN model.

To debug further, I ran with NCCL_DEBUG=INFO, TORCH_NCCL_ASYNC_ERROR_HANDLING=1, and CUDA_LAUNCH_BLOCKING=1. I also attempted rank-wise logging to pinpoint where execution stops, but this did not surface a clear failure location.

The most puzzling aspect is that enabling TORCH_DISTRIBUTED_DEBUG=DETAIL consistently makes the problem disappear, which suggests some form of timing, synchronization, or model-graph-related issue that is being masked by the debug instrumentation.

Has anyone encountered a situation where enabling TORCH_DISTRIBUTED_DEBUG=DETAIL changes DDP runtime behavior like this? Are there recommended strategies to isolate the exact model component responsible when the failure does not surface clearly?

Thanks.

log from rank0

[rank0]: Traceback (most recent call last):
[rank0]:   File “scripts/main_run.py”, line 866, in 
[rank0]:     main()
[rank0]:   File “scripts/main_run.py”, line 758, in main
[rank0]:     train_largescale_unetgan(args=args, netG=netG, netD=netD, criterion_gan=criterion, criterion_content=criterion, optimizerG=optimizerG, optimizerD=optimizerD, dataset=data_info_dict[“train_dataset”], test_dataset=data_info_dict[“val_dataset”])
[rank0]:   File “scripts/train_eval.py”, line 1604, in train_largescale_unetgan
[rank0]:     fake, x_recon_lr, M = netG(temp_data)
[rank0]:   File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(args, **kwargs)
[rank0]:   File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1747, in _call_impl
[rank0]:     return forward_call(args, *kwargs)
[rank0]:   File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 1639, in forward
[rank0]:     inputs, kwargs = self._pre_forward(inputs, **kwargs)
[rank0]:   File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 1535, in _pre_forward
[rank0]:     self._sync_buffers()
[rank0]:   File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 2172, in _sync_buffers
[rank0]:     self._sync_module_buffers(authoritative_rank)
[rank0]:   File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 2176, in _sync_module_buffers
[rank0]:     self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
[rank0]:   File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 2198, in _default_broadcast_coalesced
[rank0]:     self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
[rank0]:   File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 2113, in _distributed_broadcast_coalesced
[rank0]:     dist._broadcast_coalesced(
[rank0]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0.
[rank0]: Exception raised from getNcclComm at ../torch/csrc/distributed/c10d/NCCLUtils.cpp:29 (most recent call first):
[rank0]: C++ CapturedTraceback:
[rank0]: #4 std::_Function_handler<std::shared_ptr<c10::LazyValuestd::string const> (), c10::SetStackTraceFetcher(std::function<std::string ()>)::{lambda()#1}>::_M_invoke(std::Any_data const&) from Logging.cpp:0
[rank0]: #5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
[rank0]: #6 c10d::NCCLComm::getNcclComm() [clone .cold] from NCCLUtils.cpp:0
[rank0]: #7 c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::BroadcastOptions const&) from ??:0
[rank0]: #8 c10d::ops::(anonymous namespace)::broadcast_CUDA(c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long) from Ops.cpp:0
[rank0]: #9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor<std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > ()(c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > >, c10::guts::typelist::typelist<c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long> >, false>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) from :0
[rank0]: #10 c10::OperatorHandle::redispatchBoxed(c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) const from :0
[rank0]: #11 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) from autograd_not_implemented_fallback.cpp:0
[rank0]: #12 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) from VariableFallbackKernel.cpp:0
[rank0]: #13 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > (c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long) from :0
[rank0]: #14 c10d::ProcessGroup::broadcast(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::BroadcastOptions const&) from :0
[rank0]: #15 c10d::broadcast_coalesced(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::ArrayRefat::Tensor, unsigned long, int) from ??:0
[rank0]: #16 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, unsigned long, int)#98}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, unsigned long, int, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guardpybind11::gil_scoped_release >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, unsigned long, int)#98}&&, void ()(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, unsigned long, int), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guardpybind11::gil_scoped_release const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
[rank0]: #17 pybind11::cpp_function::dispatcher(_object, _object*, _object*) from :0
[rank0]: #18 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
[rank0]: #19 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank0]: #20 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank0]: #21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #23 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #30 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank0]: #31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #32 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank0]: #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #34 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank0]: #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #36 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank0]: #37 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank0]: #38 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank0]: #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank0]: #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #41 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #45 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
[rank0]: #46 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
[rank0]: #47 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
[rank0]: #48 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
[rank0]: #49 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
[rank0]: #50 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
[rank0]: #51 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
[rank0]: #52 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
[rank0]: #53 __libc_start_call_main from ??:0
[rank0]: #54 __libc_start_main_alias_2 from ??:0
[rank0]: #55 _start from ??:0

log from rank1

[rank1]:   File "scripts/main_run.py", line 866, in <module>
[rank1]:     main()
[rank1]:   File "scripts/main_run.py", line 758, in main
[rank1]:     train_largescale_unetgan(args=args, netG=netG, netD=netD, criterion_gan=criterion, criterion_content=criterion, optimizerG=optimizerG, optimizerD=optimizerD, dataset=data_info_dict["train_dataset"], test_dataset=data_info_dict["val_dataset"])
[rank1]:   File "scripts/train_eval.py", line 1604, in train_largescale_unetgan
[rank1]:     fake, x_recon_lr, M = netG(temp_data)
[rank1]:   File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank1]:     inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank1]:   File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1535, in _pre_forward
[rank1]:     self._sync_buffers()
[rank1]:   File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2172, in _sync_buffers
[rank1]:     self._sync_module_buffers(authoritative_rank)
[rank1]:   File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2176, in _sync_module_buffers
[rank1]:     self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
[rank1]:   File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2198, in _default_broadcast_coalesced
[rank1]:     self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
[rank1]:   File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2113, in _distributed_broadcast_coalesced
[rank1]:     dist._broadcast_coalesced(
[rank1]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. 
[rank1]: Exception raised from getNcclComm at ../torch/csrc/distributed/c10d/NCCLUtils.cpp:29 (most recent call first):
[rank1]: C++ CapturedTraceback:
[rank1]: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::string> const> (), c10::SetStackTraceFetcher(std::function<std::string ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank1]: #5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
[rank1]: #6 c10d::NCCLComm::getNcclComm() [clone .cold] from NCCLUtils.cpp:0
[rank1]: #7 c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from ??:0
[rank1]: #8 c10d::ops::(anonymous namespace)::broadcast_CUDA(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from Ops.cpp:0
[rank1]: #9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank1]: #10 c10::OperatorHandle::redispatchBoxed(c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const from :0
[rank1]: #11 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank1]: #12 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableFallbackKernel.cpp:0
[rank1]: #13 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from :0
[rank1]: #14 c10d::ProcessGroup::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from :0
[rank1]: #15 c10d::broadcast_coalesced(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::ArrayRef<at::Tensor>, unsigned long, int) from ??:0
[rank1]: #16 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, unsigned long, int)#98}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, unsigned long, int, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, unsigned long, int)#98}&&, void (*)(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, unsigned long, int), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
[rank1]: #17 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
[rank1]: #18 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
[rank1]: #19 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank1]: #20 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank1]: #21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #23 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #30 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank1]: #31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #32 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank1]: #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #34 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank1]: #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #36 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank1]: #37 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank1]: #38 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank1]: #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank1]: #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #41 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #45 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
[rank1]: #46 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
[rank1]: #47 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
[rank1]: #48 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
[rank1]: #49 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
[rank1]: #50 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
[rank1]: #51 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
[rank1]: #52 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
[rank1]: #53 __libc_start_call_main from ??:0
[rank1]: #54 __libc_start_main_alias_2 from ??:0
[rank1]: #55 _start from ??:0```

Update: I found the issue. At some point, the VSCode agent inserted a misplaced call to the training script: it was being invoked after _cleanup_distributed() had already run. I verified this by checking dist.is_initialized() in both places: True in main, but False inside the training script.

Though I still don’t get why enabling TORCH_DISTRIBUTED_DEBUG=DETAIL allowed the script to run. My assumption is that I ended up in a weird state where the model was wrapped with DDP, but the default process group had already been destroyed - if such a state is even possible. Since I launch with mpirun, it may have effectively behaved like independent processes with no remaining synchronization across ranks. But I don’t know for sure.