Timeout in distributed training

DDP Error: Issue during _sync_buffers / _distributed_broadcast_coalesced

Hi everyone,

I’m encountering an issue while using torch.nn.parallel.DistributedDataParallel (DDP) for distributed training and would appreciate some guidance. The error seems to occur during the buffer synchronization step within the DDP forward pass.

Problem Description:

During the forward pass of my DDP-wrapped model, I’m getting an error that originates from torch/nn/parallel/distributed.py. The traceback from rank0 (partially shown below) points towards an issue in _sync_buffers and subsequently in _distributed_broadcast_coalesced.

Partial Traceback (from rank0):

`File "/home/ubuntu/nemo-asr-finetuning/nemo_asr_finetuning/speech_to_text_prenet.py", line 214, in <module>                                                                                               
    main()                                                                                                                                                                                                  
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/nemo/core/config/hydra_runner.py", line 129, in wrapper                                   
    _run_hydra(                                                                                                                                                                                             
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra                                        
    _run_app(                                                                                                                                                                                               
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app                                          
    run_and_report(                                                                                                                                                                                         
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report                                    
    raise ex                                                                                                                                                                                                
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report                                    
    return func()                                                                                                                                                                                           
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>                                          
    lambda: hydra.run(                                                                                                                                                                                      
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run                                               
    _ = ret.return_value                                                                                                                                                                                    
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value                                           
    raise self._return_value                                                                                                                                                                                
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job                                                
    ret.return_value = task_function(task_cfg)                                                                                                                                                              
  File "/home/ubuntu/nemo-asr-finetuning/nemo_asr_finetuning/speech_to_text_prenet.py", line 211, in main                                                                                                   
    trainer.fit(asr_model)                                                                                                                                                                                  
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit                                   
    call._call_and_handle_interrupt(                                                                                                                                                                        
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt                
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)                                                                                                                   
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch         
    return function(*args, **kwargs)                                                                                                                                                                        
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
    self._optimizer_step(batch_idx, closure)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
    call._call_lightning_module_hook(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1306, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 153, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 137, in wrapper
    return func.__get__(opt, opt.__class__)(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
    out = func(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/optim/adamw.py", line 197, in step
    loss = closure()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
    closure_result = closure()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 389, in training_step
    return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 640, in __call__
    wrapper_output = wrapper_module(*args, **kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1639, in forward
    inputs, kwargs = self._pre_forward(*inputs, **kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1535, in _pre_forward
    self._sync_buffers()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2172, in _sync_buffers
    self._sync_module_buffers(authoritative_rank)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2176, in _sync_module_buffers
    self._default_broadcast_coalesced(authoritative_rank=authoritative_rank)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2198, in _default_broadcast_coalesced
    self._distributed_broadcast_coalesced(bufs, bucket_size, authoritative_rank)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-Q73X3XYt-py3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 2113, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: [Rank 0]: Ranks 1 failed to pass monitoredBarrier in 1800000 ms

Environment: PyTorch Version: 2.6.0+cu118* Backend: nccl

  • OS: Ubuntu 20.04
  • Python Version: 3.10