Hi all,
I’m encountering a strange and fully reproducible issue while migrating a training script from DataParallel (DP) to DistributedDataParallel (DDP). I’m able to run a MWE with DDP successfully, and the same training code runs correctly under DP. However, when I switch my full model to DDP, the script fails or hangs.
What’s particularly confusing is that the exact same code runs successfully if I set TORCH_DISTRIBUTED_DEBUG=DETAIL. Without this flag, the script fails every time. This is not random - the presence or absence of the debug flag always changes the outcome.
The failure appears to happen very early: the script does not even reach the first print statement in forward when it is called. To rule out common synchronization issues, I’ve removed the DistributedSampler, added multiple dist.barrier() calls, and verified that the issue does not appear to be related to the dataloader or input pipeline. A toy CNN model runs fine under DDP without the debug flag, the issue seems specific to my actual CNN model.
To debug further, I ran with NCCL_DEBUG=INFO, TORCH_NCCL_ASYNC_ERROR_HANDLING=1, and CUDA_LAUNCH_BLOCKING=1. I also attempted rank-wise logging to pinpoint where execution stops, but this did not surface a clear failure location.
The most puzzling aspect is that enabling TORCH_DISTRIBUTED_DEBUG=DETAIL consistently makes the problem disappear, which suggests some form of timing, synchronization, or model-graph-related issue that is being masked by the debug instrumentation.
Has anyone encountered a situation where enabling TORCH_DISTRIBUTED_DEBUG=DETAIL changes DDP runtime behavior like this? Are there recommended strategies to isolate the exact model component responsible when the failure does not surface clearly?
Thanks.
log from rank0
[rank0]: Traceback (most recent call last):
[rank0]: File “scripts/main_run.py”, line 866, in
[rank0]: main()
[rank0]: File “scripts/main_run.py”, line 758, in main
[rank0]: train_largescale_unetgan(args=args, netG=netG, netD=netD, criterion_gan=criterion, criterion_content=criterion, optimizerG=optimizerG, optimizerD=optimizerD, dataset=data_info_dict[“train_dataset”], test_dataset=data_info_dict[“val_dataset”])
[rank0]: File “scripts/train_eval.py”, line 1604, in train_largescale_unetgan
[rank0]: fake, x_recon_lr, M = netG(temp_data)
[rank0]: File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(args, **kwargs)
[rank0]: File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1747, in _call_impl
[rank0]: return forward_call(args, *kwargs)
[rank0]: File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 1639, in forward
[rank0]: inputs, kwargs = self._pre_forward(inputs, **kwargs)
[rank0]: File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 1535, in _pre_forward
[rank0]: self._sync_buffers()
[rank0]: File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 2172, in _sync_buffers
[rank0]: self._sync_module_buffers(authoritative_rank)
[rank0]: File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 2176, in _sync_module_buffers
[rank0]: self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
[rank0]: File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 2198, in _default_broadcast_coalesced
[rank0]: self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
[rank0]: File “s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py”, line 2113, in _distributed_broadcast_coalesced
[rank0]: dist._broadcast_coalesced(
[rank0]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0.
[rank0]: Exception raised from getNcclComm at ../torch/csrc/distributed/c10d/NCCLUtils.cpp:29 (most recent call first):
[rank0]: C++ CapturedTraceback:
[rank0]: #4 std::_Function_handler<std::shared_ptr<c10::LazyValuestd::string const> (), c10::SetStackTraceFetcher(std::function<std::string ()>)::{lambda()#1}>::_M_invoke(std::Any_data const&) from Logging.cpp:0
[rank0]: #5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
[rank0]: #6 c10d::NCCLComm::getNcclComm() [clone .cold] from NCCLUtils.cpp:0
[rank0]: #7 c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::BroadcastOptions const&) from ??:0
[rank0]: #8 c10d::ops::(anonymous namespace)::broadcast_CUDA(c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long) from Ops.cpp:0
[rank0]: #9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor<std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > ()(c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > >, c10::guts::typelist::typelist<c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long> >, false>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) from :0
[rank0]: #10 c10::OperatorHandle::redispatchBoxed(c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) const from :0
[rank0]: #11 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) from autograd_not_implemented_fallback.cpp:0
[rank0]: #12 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) from VariableFallbackKernel.cpp:0
[rank0]: #13 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > (c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, long, long, bool, long) from :0
[rank0]: #14 c10d::ProcessGroup::broadcast(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::BroadcastOptions const&) from :0
[rank0]: #15 c10d::broadcast_coalesced(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::ArrayRefat::Tensor, unsigned long, int) from ??:0
[rank0]: #16 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, unsigned long, int)#98}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, unsigned long, int, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guardpybind11::gil_scoped_release >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, unsigned long, int)#98}&&, void ()(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, std::vector<at::Tensor, std::allocatorat::Tensor > const&, unsigned long, int), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guardpybind11::gil_scoped_release const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
[rank0]: #17 pybind11::cpp_function::dispatcher(_object, _object*, _object*) from :0
[rank0]: #18 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
[rank0]: #19 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank0]: #20 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank0]: #21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #23 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #30 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank0]: #31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #32 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank0]: #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #34 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank0]: #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #36 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank0]: #37 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank0]: #38 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank0]: #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank0]: #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #41 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank0]: #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank0]: #45 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
[rank0]: #46 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
[rank0]: #47 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
[rank0]: #48 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
[rank0]: #49 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
[rank0]: #50 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
[rank0]: #51 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
[rank0]: #52 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
[rank0]: #53 __libc_start_call_main from ??:0
[rank0]: #54 __libc_start_main_alias_2 from ??:0
[rank0]: #55 _start from ??:0
log from rank1
[rank1]: File "scripts/main_run.py", line 866, in <module>
[rank1]: main()
[rank1]: File "scripts/main_run.py", line 758, in main
[rank1]: train_largescale_unetgan(args=args, netG=netG, netD=netD, criterion_gan=criterion, criterion_content=criterion, optimizerG=optimizerG, optimizerD=optimizerD, dataset=data_info_dict["train_dataset"], test_dataset=data_info_dict["val_dataset"])
[rank1]: File "scripts/train_eval.py", line 1604, in train_largescale_unetgan
[rank1]: fake, x_recon_lr, M = netG(temp_data)
[rank1]: File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
[rank1]: inputs, kwargs = self._pre_forward(*inputs, **kwargs)
[rank1]: File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1535, in _pre_forward
[rank1]: self._sync_buffers()
[rank1]: File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2172, in _sync_buffers
[rank1]: self._sync_module_buffers(authoritative_rank)
[rank1]: File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2176, in _sync_module_buffers
[rank1]: self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
[rank1]: File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2198, in _default_broadcast_coalesced
[rank1]: self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
[rank1]: File "s/conda-envs/rlall/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2113, in _distributed_broadcast_coalesced
[rank1]: dist._broadcast_coalesced(
[rank1]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1.
[rank1]: Exception raised from getNcclComm at ../torch/csrc/distributed/c10d/NCCLUtils.cpp:29 (most recent call first):
[rank1]: C++ CapturedTraceback:
[rank1]: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::string> const> (), c10::SetStackTraceFetcher(std::function<std::string ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
[rank1]: #5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
[rank1]: #6 c10d::NCCLComm::getNcclComm() [clone .cold] from NCCLUtils.cpp:0
[rank1]: #7 c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from ??:0
[rank1]: #8 c10d::ops::(anonymous namespace)::broadcast_CUDA(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from Ops.cpp:0
[rank1]: #9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
[rank1]: #10 c10::OperatorHandle::redispatchBoxed(c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const from :0
[rank1]: #11 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
[rank1]: #12 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableFallbackKernel.cpp:0
[rank1]: #13 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from :0
[rank1]: #14 c10d::ProcessGroup::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from :0
[rank1]: #15 c10d::broadcast_coalesced(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::ArrayRef<at::Tensor>, unsigned long, int) from ??:0
[rank1]: #16 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, unsigned long, int)#98}, void, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, unsigned long, int, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, unsigned long, int)#98}&&, void (*)(c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, unsigned long, int), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
[rank1]: #17 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
[rank1]: #18 cfunction_call from /usr/local/src/conda/python-3.10.16/Objects/methodobject.c:543
[rank1]: #19 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank1]: #20 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank1]: #21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #23 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #24 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #25 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #26 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #29 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #30 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank1]: #31 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #32 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank1]: #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #34 do_call_core from /usr/local/src/conda/python-3.10.16/Python/ceval.c:5945
[rank1]: #35 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #36 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.16/Objects/call.c:431
[rank1]: #37 slot_tp_call from /usr/local/src/conda/python-3.10.16/Objects/typeobject.c:7494
[rank1]: #38 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.16/Objects/call.c:215
[rank1]: #39 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:112
[rank1]: #40 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #41 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #43 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114
[rank1]: #44 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46
[rank1]: #45 PyEval_EvalCode from /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134
[rank1]: #46 run_eval_code_obj from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291
[rank1]: #47 run_mod from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312
[rank1]: #48 pyrun_file from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1208
[rank1]: #49 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:456
[rank1]: #50 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:90
[rank1]: #51 pymain_run_file_obj from /usr/local/src/conda/python-3.10.16/Modules/main.c:357
[rank1]: #52 Py_BytesMain from /usr/local/src/conda/python-3.10.16/Modules/main.c:1094
[rank1]: #53 __libc_start_call_main from ??:0
[rank1]: #54 __libc_start_main_alias_2 from ??:0
[rank1]: #55 _start from ??:0```