RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803360 milliseconds before timing out

I try to finetune a model on 4 rtx3090. The training process is Ok with 1/10 dataset. However, when using full dataset, I get the timeout error at the second training epoch.

The python is 3.8.18, torch version is 2.0.0 and torch cuda version is 11.7.

I set following environment variables to see the details.
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=INFO

And the logs are shown below.


[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802170 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802194 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802194 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803360 milliseconds before timing out.
mit119-gpu:661430:661447 [0] NCCL INFO comm 0x4033bbe0 rank 3 nranks 4 cudaDev 3 busId 89000 - Abort COMPLETE
Traceback (most recent call last):   
  File "train.py", line 457, in <module>
    main(args)
  File "train.py", line 424, in main 
    train_stats = train_method(
  File "/home/yutingbai/LaVIN2/engine.py", line 61, in train_one_epoch
    c_loss = model(examples, labels,images=images, prefix_img=prefix_img, prefix_nonimg=prefix_nonimg,img_indicators=indicators, prompt_only_question=prompt_only_question, question_mask=question_mask,)
mit119-gpu:661428:661498 [1] NCCL INFO AllReduce: opCount 35bfc sendbuff 0x7fdc0802f800 recvbuff 0x7fdc0802f800 count 311322 datatype 7 op 0 root 0 comm 0x3bb37d80 [nranks=4] stream 0x3bb36de0
mit119-gpu:661428:661498 [1] NCCL INFO AllReduce: opCount 35bfd sendbuff 0x7fdd21bc5000 recvbuff 0x7fdd21bc5000 count 583 datatype 2 op 0 root 0 comm 0x3bb37d80 [nranks=4] stream 0x3bb36de0
mit119-gpu:661427:661493 [0] NCCL INFO AllReduce: opCount 35bfa sendbuff 0x7f2930800000 recvbuff 0x7f2930800000 count 10131620 datatype 7 op 0 root 0 comm 0x3d6ed550 [nranks=4] stream 0x3d6f0b70
mit119-gpu:661427:661493 [0] NCCL INFO AllReduce: opCount 35bfb sendbuff 0x7f2932ea6400 recvbuff 0x7f2932ea6400 count 6636594 datatype 7 op 0 root 0 comm 0x3d6ed550 [nranks=4] stream 0x3d6f0b70
mit119-gpu:661427:661493 [0] NCCL INFO AllReduce: opCount 35bfc sendbuff 0x7f284202f800 recvbuff 0x7f284202f800 count 311322 datatype 7 op 0 root 0 comm 0x3d6ed550 [nranks=4] stream 0x3d6f0b70
mit119-gpu:661427:661493 [0] NCCL INFO AllReduce: opCount 35bfd sendbuff 0x7f295bbc5000 recvbuff 0x7f295bbc5000 count 583 datatype 2 op 0 root 0 comm 0x3d6ed550 [nranks=4] stream 0x3d6f0b70
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802170 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802194 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802194 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803360 milliseconds before timing out.
mit119-gpu:661430:661447 [0] NCCL INFO comm 0x4033bbe0 rank 3 nranks 4 cudaDev 3 busId 89000 - Abort COMPLETE
Traceback (most recent call last):   
  File "train.py", line 457, in <module>
    main(args)
  File "train.py", line 424, in main 
    train_stats = train_method(
  File "/home/yutingbai/LaVIN2/engine.py", line 61, in train_one_epoch
    c_loss = model(examples, labels,images=images, prefix_img=prefix_img, prefix_nonimg=prefix_nonimg,img_indicators=indicators, prompt_only_question=prompt_only_question, question_mask=question_mask,)
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1148, in forward
    self._sync_buffers()
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1748, in _sync_buffers
    self._sync_module_buffers(authoritative_rank)
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1752, in _sync_module_buffers
    self._default_broadcast_coalesced(
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1775, in _default_broadcast_coalesced
    self._distributed_broadcast_coalesced(
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1689, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: NCCL communicator was aborted on rank 3.  Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=220154, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1803360 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 661427 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 661428 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 661429 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 661430) of binary: /home/yutingbai/anaconda3/envs/bot/bin/python
Traceback (most recent call last):   
  File "/home/yutingbai/anaconda3/envs/bot/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/yutingbai/anaconda3/envs/bot/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure): 
[0]:
  time      : 2024-04-30_13:36:22
  host      : super-SYS-4028GR-TR
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 661430)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 661430

I do not find useful information to resolve this problem from the logs. Any suggestions or ideas?