RuntimeError: NCCL communicator was aborted on rank 0 when training on multity GPUS

Background:
When training the model, it runs fine on a single GPU. However, when using 2 or more GPUs, errors occur. Each error occurs at the end of training one epoch.
The error on the master node occurs at default_pg.allreduce , while the error messages on the worker nodes are not consistent. Sometimes the error occurs at norm_layer in swin_transformer , sometimes at fc , and sometimes at attn_mask.masked_fill .
Here is training command

python3.7 -m torch.distributed.launch \
     --nproc_per_node=2 \
     --nnodes=1 \
       main.py \
       --data_root /mnt/common/Next-Generation-OCR/dataset/src/ \
       --anno_root /mnt/common/Next-Generation-OCR/dataset/src/ \
       --output_folder /mnt/common/wangpeng/research/exp/sptsOCR/debug/ \
       --train_dataset ctw1500  \
       --lr 0.0005 \
       --max_steps 1000000 \
       --warmup_steps 5000 \
       --checkpoint_freq 20 \
       --checkpoint_freq_step 20000 \
       --batch_size 1 \
       --tfm_pre_norm \
       --train_max_size 768 \
       --rec_loss_weight 2 \
       --num_workers 0 \
       --image \
       --seed 253 \
       --max_prompt 8 \
       --granularity word

Here is master error:

poch: [13] Total time: 0:05:03 (1.2123 s / it)
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=450000) ran for 459084 milliseconds before timing out.
Traceback (most recent call last):
  File "main.py", line 100, in <module>
    main(args)
  File "main.py", line 73, in main
    train_stats, global_step = train_one_epoch(model, train_dataloader, criterion, optimizer, lr_scheduler, epoch, global_step, checkpointer, checkpoint_folder, args)
  File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/engine/train.py", line 101, in train_one_epoch
    metric_logger.synchronize_between_processes()
  File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/utils/logger.py", line 100, in synchronize_between_processes
    meter.synchronize_between_processes()
  File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/utils/logger.py", line 34, in synchronize_between_processes
    dist.all_reduce(t)
  File "/home/pai/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1285, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=450000) ran for 459084 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=450000) ran for 459084 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 299) of binary: /home/pai/bin/python3
Traceback (most recent call last):
  File "/home/pai/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/pai/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/pai/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/pai/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/pai/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/pai/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/pai/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/pai/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Here is worker error:

/workspace/artifacts/paipytorch1.10/dist/ubuntu18.04-py3.6-cuda11.3/build/src/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [2,0,0], thread: [29,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/workspace/artifacts/paipytorch1.10/dist/ubuntu18.04-py3.6-cuda11.3/build/src/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [2,0,0], thread: [30,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/workspace/artifacts/paipytorch1.10/dist/ubuntu18.04-py3.6-cuda11.3/build/src/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [2,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

damo-0138:295:427 [0] init.cc:988 NCCL WARN Cuda failure 'device-side assert triggered'
Traceback (most recent call last):
  File "main.py", line 100, in <module>
    main(args)
  File "main.py", line 73, in main
    train_stats, global_step = train_one_epoch(model, train_dataloader, criterion, optimizer, lr_scheduler, epoch, global_step, checkpointer, checkpoint_folder, args)
  File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/engine/train.py", line 38, in train_one_epoch
    outputs = model(samples, input_seqs_, mask_prompts)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/spts.py", line 16, in forward
    features, pos = self.backbone(samples)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/joiner.py", line 11, in forward
    xs = self[0](tensor_list)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 618, in forward
    x_out = norm_layer(x_out)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/normalization.py", line 190, in forward
    input, self.normalized_shape, self.weight, self.bias, self.eps)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/functional.py", line 2347, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /workspace/artifacts/paipytorch1.10/dist/ubuntu18.04-py3.6-cuda11.3/build/src/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x60 (0x7f6465886750 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1df9e (0x7f64ab98bf9e in /home/pai/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1bb (0x7f64ab98d26b in /home/pai/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f646586d8a4 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x296a19 (0x7f64b8f5fa19 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xb1e385 (0x7f64b97e7385 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x2c9 (0x7f64b97e76d9 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xf2ec7 (0x560870d7eec7 in /home/pai/bin/python3)
frame #8: <unknown function> + 0xf2ec7 (0x560870d7eec7 in /home/pai/bin/python3)
frame #9: <unknown function> + 0xf2787 (0x560870d7e787 in /home/pai/bin/python3)
frame #10: <unknown function> + 0xf2617 (0x560870d7e617 in /home/pai/bin/python3)
frame #11: <unknown function> + 0xf262d (0x560870d7e62d in /home/pai/bin/python3)
frame #12: <unknown function> + 0xf262d (0x560870d7e62d in /home/pai/bin/python3)
frame #13: <unknown function> + 0xf262d (0x560870d7e62d in /home/pai/bin/python3)
frame #14: <unknown function> + 0xf262d (0x560870d7e62d in /home/pai/bin/python3)
frame #15: PyDict_SetItem + 0x3da (0x560870dc54ba in /home/pai/bin/python3)
frame #16: PyDict_SetItemString + 0x4f (0x560870dcc4df in /home/pai/bin/python3)
frame #17: PyImport_Cleanup + 0x99 (0x560870e31d49 in /home/pai/bin/python3)
frame #18: Py_FinalizeEx + 0x61 (0x560870e9c061 in /home/pai/bin/python3)
frame #19: Py_Main + 0x35e (0x560870ea63ae in /home/pai/bin/python3)
frame #20: main + 0xee (0x560870d7043e in /home/pai/bin/python3)
frame #21: __libc_start_main + 0xe7 (0x7f64edca8bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x1c3d0b (0x560870e4fd0b in /home/pai/bin/python3)

Sometimes the error is like this:

dsw64682-57b6f9d7bc-lr6zq:16471:16480 [1] init.cc:924 NCCL WARN Cuda failure 'device-side assert triggered'
Traceback (most recent call last):
  File "main.py", line 100, in <module>
    main(args)
  File "main.py", line 73, in main
    train_stats, global_step = train_one_epoch(model, train_dataloader, criterion, optimizer, lr_scheduler, epoch, global_step, checkpointer, checkpoint_folder, args)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/engine/train.py", line 38, in train_one_epoch
    outputs = model(samples, input_seqs_, mask_prompts)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted.
    result = self.forward(*input, **kwargs)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
[E ProcessGroupNCCL.cpp:294] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/spts.py", line 16, in forward
    features, pos = self.backbone(samples)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/joiner.py", line 11, in forward
    xs = self[0](tensor_list)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 614, in forward
    x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 387, in forward
    attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0))

or:

dsw64682-57b6f9d7bc-lr6zq:26670:26678 [1] init.cc:924 NCCL WARN Cuda failure 'device-side assert triggered'
    main(args)
  File "main.py", line 73, in main
    train_stats, global_step = train_one_epoch(model, train_dataloader, criterion, optimizer, lr_scheduler, epoch, global_step, checkpointer, checkpoint_folder, args)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/engine/train.py", line 38, in train_one_epoch
    outputs = model(samples, input_seqs_, mask_prompts)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted.
    result = self.forward(*input, **kwargs)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/spts.py", line 16, in forward
    features, pos = self.backbone(samples)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/joiner.py", line 11, in forward
    xs = self[0](tensor_list)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 614, in forward
    x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 394, in forward
    x = blk(x, attn_mask)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 251, in forward
    x = x + self.drop_path(self.mlp(self.norm2(x)))
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 31, in forward
    x = self.fc1(x)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Killing subprocess 26669
Killing subprocess 26670
Traceback (most recent call last):
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/mnt/common/wangpeng/conda/envs/uniocr/bin/python3.7', '-u', 'main.py', '--local_rank=1', '--data_root', '/mnt/common/Next-Generation-OCR/dataset/src/', '--anno_root', '/mnt/common/Next-Generation-OCR/dataset/src/', '--output_folder', '/mnt/common/wangpeng/research/exp/sptsOCR/debug/', '--train_dataset', 'ic15', '--lr', '0.0005', '--max_steps', '1000000', '--warmup_steps', '5000', '--checkpoint_freq', '20', '--checkpoint_freq_step', '20000', '--batch_size', '1', '--tfm_pre_norm', '--train_max_size', '768', '--rec_loss_weight', '2', '--num_workers', '0', '--image', '--seed', '253', '--max_prompt', '8', '--granularity', 'word']' returned non-zero exit status 1.

An indexing operation failed and NCCL is just the victim here and not to blame for the error as it’s just re-raising the error message. Often embedding layers are receiving invalid indices so I would start by checking these in case your model has them.