Backward() hangs at chance when using DDP

Ferd · April 14, 2023, 6:01am

I’m having an issue that my code randomly hangs at loss.backward() when using DistributedDataParallel.
It is completely random when this occurs, all GPU with utilizaiton 100%.
It seems like a synchronization problem, however i cannot find the specific reason.
I have checked that all parameters in the model are used and there is no conditional branch in the model.

I am really confused, any help would be appreciated.

torch 1.7.1+cu110
torchvision 0.8.2+cu110

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

ptrblck · April 14, 2023, 6:08am

You are using quite an old PyTorch release so I would recommend updating to the latest one and check if your code would still randomly hang.

Ferd · April 17, 2023, 2:46pm

Thanks for your reply. I upgrade pytorch to 1.13.0 and cuda 11.7 but still have the problem.

[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000600, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808622 milliseconds before timing out.s/it, total_it=18627]
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000599, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808635 milliseconds before timing out.                     
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000599, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808721 milliseconds before timing out.                     
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000599, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808928 milliseconds before timing out.                     
Traceback (most recent call last):                                                                                                                                                                                      
  File "tools/train.py", line 220, in <module>                                                                                                                                                                          
    main()                                                                                                                                                                                                              
  File "tools/train.py", line 193, in main                                                                                                                                                                              
    logger=logger                                                                                                                                                                                                       
  File "/root/DSGN2/tools/train_utils/train_utils.py", line 141, in train_model                                                                                                                                         
    logger=logger                                                                                                                                                                                                       
  File "/root/DSGN2/tools/train_utils/train_utils.py", line 54, in train_one_epoch                                                                                                                                      
    loss.backward()                                                                                                                                                                                                     
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/_tensor.py", line 488, in backward                                                                                                                 
    self, gradient, retain_graph, create_graph, inputs=inputs                                                                                                                                                           
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward                                                                                                       
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass                                                                                                                 
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/autograd/function.py", line 267, in apply                                                                                                          
    return user_fn(self, *args)                                                                                                                                                                                         
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 131, in backward                                                                                                   
    combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)                                                                                                                                            
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1536, in all_reduce                                                                                         
    work = group.allreduce([tensor], opts)                                                                                                                                                                              
RuntimeError: NCCL communicator was aborted on rank 3.  Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000600, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 
1808622 milliseconds before timing out.                                                                                                                                                                                 
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.                           
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000600, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808622 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 401125 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 401126 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 401127 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 401128) of binary: /root/miniconda3/envs/dsgn/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/dsgn/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
tools/train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-17_21:32:01
  host      : autodl-container-e12911b43c-3e093856
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 401128)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 401128
=======================================================

Ferd · April 24, 2023, 3:43am

if i do not use sync_bn, it works well
However the performace drops, if do so.

@ptrblck

ptrblck · April 24, 2023, 7:07am

1.13.0 is still an older release and won’t be fixed anymore, so you would still need to check the latest stable and/or nightly release.

baiyuting · April 26, 2024, 7:30am

I also get this problem with torch 1.13.0 and cuda 11.7, do you solve this problem? @Ferd

Ferd · April 28, 2024, 3:57am

no, did you use SyncBN?

baiyuting · April 28, 2024, 6:17am

I did not find SyncBatchNorm used in my code. It still hangs and logs collective operation timeout in 1st epoch or 2nd epoch.