RuntimeError: NCCL communicator was aborted

Hi, I am using DDP on a single node with NCCL backend. After a couple of training epochs I got the following warning:

[E ProcessGroupNCCL.cpp:587] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803308 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803187 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803385 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803386 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803385 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802504 milliseconds before timing out.

and then I got the following traceback on each of the GPUs:

  File "/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/venv/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 878, in forward
    self._sync_params()
  File "/venv/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1379, in _sync_params
    self._distributed_broadcast_coalesced(
  File "/venv/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1334, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803385 milliseconds before timing out.

I am using torch version 1.10.0+cu102 with python3.9. Any idea?

This probably indicates that there might have been some sort of CUDA/NCCL deadlock causing these timeouts. There are a few ways to debug this:

  1. Set environment variable NCCL_DEBUG=INFO, this will print NCCL debugging information.
  2. Set environment variable TORCH_DISTRIBUTED_DETAIL=DEBUG, this will add significant additional overhead but will give you an exact error if there are mismatched collectives.

You can also try passing broadcast_buffers=False to DDP, although note that this will disable buffer synchronization which might affect model quality if you wanted to ensure all buffers are synchronized.

Thank you @pritamdamania87 @rvarm1 for your replies.

@pritamdamania87 I am running with the env variables you mentioned to see the debugging log. Since it is randomly generated at some epochs during the training.

@rvarm1 is it a standard solution to avoid such problems?