DDP Error: Issue during _sync_buffers
/ _distributed_broadcast_coalesced
Hi everyone,
I’m encountering an issue while using torch.nn.parallel.DistributedDataParallel
(DDP) for distributed training and would appreciate some guidance. The error seems to occur during the buffer synchronization step within the DDP forward pass.
Problem Description:
During the forward pass of my DDP-wrapped model, I’m getting an error that originates from torch/nn/parallel/distributed.py
. The traceback from rank0
(partially shown below) points towards an issue in _sync_buffers
and subsequently in _distributed_broadcast_coalesced
.
Partial Traceback (from rank0):
`File "/home/ubuntu/nemo-asr-finetuning/nemo_asr_finetuning/speech_to_text_prenet.py", line 214, in <module>
main()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/nemo/core/config/hydra_runner.py", line 129, in wrapper
_run_hydra(
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/ubuntu/nemo-asr-finetuning/nemo_asr_finetuning/speech_to_text_prenet.py", line 211, in main
trainer.fit(asr_model)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
results = self._run_stage()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
self.fit_loop.run()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
self.advance()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
self.advance(data_fetcher)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
self._optimizer_step(batch_idx, closure)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
call._call_lightning_module_hook(
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1306, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 153, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
return func.__get__(opt, opt.__class__)(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
out = func(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/optim/adamw.py", line 197, in step
loss = closure()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
closure_result = closure()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
step_output = self._step_fn()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 389, in training_step
return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 640, in __call__
wrapper_output = wrapper_module(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
inputs, kwargs = self._pre_forward(*inputs, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1535, in _pre_forward
self._sync_buffers()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2172, in _sync_buffers
self._sync_module_buffers(authoritative_rank)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2176, in _sync_module_buffers
self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2198, in _default_broadcast_coalesced
self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2113, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: [Rank 0]: Ranks 1 failed to pass monitoredBarrier in 1800000 ms
Environment: PyTorch Version: 2.6.0+cu118
* Backend: nccl
- OS:
Ubuntu 20.04
- Python Version:
3.10