Backward() hangs at chance when using DDP

I’m having an issue that my code randomly hangs at loss.backward() when using DistributedDataParallel.
It is completely random when this occurs, all GPU with utilizaiton 100%.
It seems like a synchronization problem, however i cannot find the specific reason.
I have checked that all parameters in the model are used and there is no conditional branch in the model.

I am really confused, any help would be appreciated.

torch 1.7.1+cu110
torchvision 0.8.2+cu110

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

You are using quite an old PyTorch release so I would recommend updating to the latest one and check if your code would still randomly hang.

Thanks for your reply. I upgrade pytorch to 1.13.0 and cuda 11.7 but still have the problem.

[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000600, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808622 milliseconds before timing out.s/it, total_it=18627]
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000599, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808635 milliseconds before timing out.                     
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000599, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808721 milliseconds before timing out.                     
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000599, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808928 milliseconds before timing out.                     
Traceback (most recent call last):                                                                                                                                                                                      
  File "tools/train.py", line 220, in <module>                                                                                                                                                                          
    main()                                                                                                                                                                                                              
  File "tools/train.py", line 193, in main                                                                                                                                                                              
    logger=logger                                                                                                                                                                                                       
  File "/root/DSGN2/tools/train_utils/train_utils.py", line 141, in train_model                                                                                                                                         
    logger=logger                                                                                                                                                                                                       
  File "/root/DSGN2/tools/train_utils/train_utils.py", line 54, in train_one_epoch                                                                                                                                      
    loss.backward()                                                                                                                                                                                                     
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/_tensor.py", line 488, in backward                                                                                                                 
    self, gradient, retain_graph, create_graph, inputs=inputs                                                                                                                                                           
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward                                                                                                       
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass                                                                                                                 
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/autograd/function.py", line 267, in apply                                                                                                          
    return user_fn(self, *args)                                                                                                                                                                                         
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 131, in backward                                                                                                   
    combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)                                                                                                                                            
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1536, in all_reduce                                                                                         
    work = group.allreduce([tensor], opts)                                                                                                                                                                              
RuntimeError: NCCL communicator was aborted on rank 3.  Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000600, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 
1808622 milliseconds before timing out.                                                                                                                                                                                 
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.                           
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000600, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808622 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 401125 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 401126 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 401127 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 401128) of binary: /root/miniconda3/envs/dsgn/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/dsgn/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
tools/train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-17_21:32:01
  host      : autodl-container-e12911b43c-3e093856
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 401128)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 401128
=======================================================

if i do not use sync_bn, it works well :sweat_smile:
However the performace drops, if do so.

@ptrblck

1.13.0 is still an older release and won’t be fixed anymore, so you would still need to check the latest stable and/or nightly release.