Code stuck at cuda.synchrize()

The code hits hanging issue and the following stack shows it is stuck at cuda.synchronize(). Is there any idea of what is going on?

pytorch = 1.9.0, cuda=11.1

(gdb) py-bt
Traceback (most recent call first):
  <built-in method _cuda_synchronize of module object at remote 0x7f0b8bc0d360>
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 446, in synchronize
    return torch._C._cuda_synchronize()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/timer.py", line 165, in start
    torch.cuda.synchronize()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2372, in forward
    load_optimizer_states=True,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1563, in _call_impl
    >>> net = nn.Sequential(l, l)
  File "/tmp/code/act/src/qd/opt/trainer.py", line 1531, in inner_loop_step
    loss_dict = self.model_engine(self.dict_data)
  File "/tmp/code/act/src/qd/opt/trainer.py", line 1672, in inner_loop
    losses /= self.gradient_accumulate
  File "/tmp/code/act/src/qd/opt/trainer.py", line 1758, in do
    self.optimizer.zero_grad()
  File "/tmp/code/act/src/qd/pipelines/uni_pipeline.py", line 1743, in do_train
    port=self.cfg.dist_url_tcp_port,
  File "/tmp/code/act/src/qd/pipelines/uni_pipeline.py", line 4443, in train
  File "/tmp/code/act/src/qd/pipelines/uni_pipeline.py", line 1540, in ensure_train
    arguments={'iteration': start_iter},
  File "/tmp/code/act/src/qd/pipeline.py", line 679, in pipeline_train_eval_multi
    pip.ensure_train()
  File "src/qd/qd_common.py", line 3349, in execute_func
  File "src/qd/qd_common.py", line 4338, in <module>
(gdb) 

Looks like there is a deadlock in the code. There could be multiple reasons:

  1. Running concurrent NCCL communications. NCCL only allows one CUDA communicator to access the device a a time. If, say pipeline and ZeRO in deepspeed launches concurrent comms, it could hang.
  2. User-code comm deadlock. Suppose there are two processes, X and Y. X calls commA → cuda.synchronize() → commB, while Y calls commB → cuda.synchronize() → commA. It could deadlock, as both process requires the subsequent comm op after cuda.synchronize() to unblock.

If you didn’t launch comm ops in your user code, could you please create an issue in deepspeed repo and discuss with deepspeed experts?

1 Like

Is there a built-in mechanism to help debug these communication deadlocks? E.g. with thread dumps we can see which thread is waiting/holding individual locks. Or is adding logging the only option?

Maybe setting TORCH_DISTRIBUTED_DEBUG=DETAIL can help?
https://pytorch.org/docs/master/distributed.html#torch-distributed-debug

1 Like

Thanks Andrew, I didn’t know this existed.