RuntimeError: NCCL communicator was aborted

fermat97 · November 11, 2021, 10:56pm

Hi, I am using DDP on a single node with NCCL backend. After a couple of training epochs I got the following warning:

[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803308 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803187 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803385 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803386 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803385 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802504 milliseconds before timing out.

and then I got the following traceback on each of the GPUs:

  File "/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/venv/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/venv/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1379, in _sync_params
    self._distributed_broadcast_coalesced(
  File "/venv/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1334, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803385 milliseconds before timing out.

I am using torch version 1.10.0+cu102 with python3.9. Any idea?

pritamdamania87 · November 15, 2021, 10:20pm

This probably indicates that there might have been some sort of CUDA/NCCL deadlock causing these timeouts. There are a few ways to debug this:

Set environment variable NCCL_DEBUG=INFO, this will print NCCL debugging information.
Set environment variable TORCH_DISTRIBUTED_DETAIL=DEBUG, this will add significant additional overhead but will give you an exact error if there are mismatched collectives.

rvarm1 · November 17, 2021, 1:46am

You can also try passing broadcast_buffers=False to DDP, although note that this will disable buffer synchronization which might affect model quality if you wanted to ensure all buffers are synchronized.

fermat97 · November 18, 2021, 12:12pm

Thank you @pritamdamania87 @rvarm1 for your replies.

@pritamdamania87 I am running with the env variables you mentioned to see the debugging log. Since it is randomly generated at some epochs during the training.

@rvarm1 is it a standard solution to avoid such problems?

Zhang_Kin · February 9, 2022, 12:51pm

I also met this problem on torch 1.10.0+cu113 with python 3.7 in docker when running at 3090, I didn’t find solution to solve this still.

LaCandela · February 24, 2022, 10:04am

I am also having this problem w/o solution…
using Docker
Python 3.9.7
torch 1.10.2
Ubuntu 20.04.3 LTS
8 V100 nVidia GPU in DELL DGX
using torch.distributed.launch

Yanli_Zhao · February 28, 2022, 8:35pm

For NCCL timeout, mostly means there are some desync issues, would you please try to set environment variable TORCH_DISTRIBUTED_DETAIL=DEBUG, this will add significant additional overhead but will give you an exact error if there are mismatched collectives.

amsword · March 9, 2022, 1:27am

is there a way to disable the following timeout and allow it hanging there forever? Then, there is a way to inspect what is going on for each rank for further debugging.

[1,158]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 158] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803931 milliseconds before timing out.
[1,105]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 105] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804038 milliseconds before timing out.
[1,110]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 110] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804067 milliseconds before timing out.
[1,207]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 207] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804115 milliseconds before timing out.
[1,248]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 248] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804127 milliseconds before timing out.
[1,202]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 202] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804206 milliseconds before timing out.
[1,106]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 106] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804245 milliseconds before timing out.
[1,104]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 104] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804245 milliseconds before timing out.
[1,251]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 251] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804274 milliseconds before timing out.
[1,205]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 205] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804314 milliseconds before timing out.
[1,203]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 203] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804386 milliseconds before timing out.
[1,252]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 252] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804396 milliseconds before timing out.
[1,254]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 254] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804410 milliseconds before timing out.
[1,255]<stderr>:[E ProcessGroupNCCL.cpp:566] [Rank 255] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804451 milliseconds before timing out.