Multi-GPU training hangs: Watchdog caught collective operation timeout

guyu23 · September 21, 2023, 7:48am

Hi, I’m training LLAVA using repo: GitHub - haotian-liu/LLaVA: Visual Instruction Tuning: Large Language-and-Vision Assistant built towards multimodal GPT-4 level capabilities.. When I use my own dataset, roughly 50w data, DDP training with 8 A100 80G, the training hangs and gives the following error:

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802710 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15170, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803156 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802713 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15170, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803216 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802791 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802786 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15172, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803288 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802877 milliseconds before timing out.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [67,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [68,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [70,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
Traceback (most recent call last):
  File "/workdir/llava/train/train_mem.py", line 16, in <module>
    train()
  File "/workdir/llava/train/train.py", line 930, in train
    trainer.train()
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/workdir/llava/model/language_model/llava_llama.py", line 75, in forward
    input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)
  File "/workdir/llava/model/llava_arch.py", line 119, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images)
  File "/workdir/llava/model/llava_arch.py", line 99, in encode_images
    image_features = self.get_model().get_vision_tower()(images)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workdir/llava/model/multimodal_encoder/donut_encoder.py", line 47, in forward
    image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype))
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workdir/llava/model/multimodal_encoder/donut.py", line 107, in forward
    x = self.model.layers(x)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/timm/models/swin_transformer.py", line 420, in forward
    x = self.blocks(x)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/timm/models/swin_transformer.py", line 310, in forward
    attn_windows = self.attn(x_windows, mask=self.attn_mask)  # num_win*B, window_size*window_size, C
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/timm/models/swin_transformer.py", line 216, in forward
    x = (attn @ v).transpose(1, 2).reshape(B_, N, -1)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Traceback (most recent call last):
  File "/workdir/llava/train/train_mem.py", line 16, in <module>
    train()
  File "/workdir/llava/train/train.py", line 930, in train
    trainer.train()
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in training_step
    self.accelerator.backward(loss)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 1853, in backward
    loss.backward(**kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: NCCL communicator was aborted on rank 6.  Original reason for failure was: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15170, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803156 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
=====================================================

At the beginning, I thought maybe some corrupt images lead to the error, because I see cuda index error in above message, and traceback show the error in swin transformer, but I checked all images use PIL Image.open and deleted all images with warning, no problem found, the training still stuck. I also check input image tensor size and they are right.
I searched many way in community, like use the following environment parameter:

CUDA_LAUNCH_BLOCKING= 1
NCCL_P2P_LEVEL=2
NCCL_P2P_DISABLE=1
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=ALL
TORCH_DISTRIBUTED_DEBUG=INFO
NCCL_IB_TIMEOUT=22
NCCL_BLOCKING_WAIT=0
unset LD_LIBRARY_PATH

it still didn’t work.
Then I tried to use 2 GPU training and batch size per device use 1, and print image path to find the stuck data, but I found the data is ok, and I constructed a dataset only contained the 2 images, the training process didn’t stuck and worked.

However, when I training on single GPU, it works fine, when I training use other datasets on DDP mode, it works fine.
So I think the code is ok and it seems there are some problems in the dataset but since single GPU worked and the dataset once used to training other model before, it seems no problems in the dataset.

I also use the following code at beginning of train.py:

torch.distributed.init_process_group(backend="gloo")

just get the error message:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3772 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3773 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3774 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3775 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3776 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3779 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 5 (pid: 3777) of binary: /miniconda/envs/llava/bin/python3
Traceback (most recent call last):
  File "/miniconda/envs/llava/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/miniconda/envs/llava/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
llava/train/train_mem.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-21_12:47:26
  host      : psx1kopxqb355ls7-worker-0
  rank      : 5 (local_rank: 5)
  exitcode  : -6 (pid: 3777)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 3777
=====================================================

I’m so confused, and I don’t know what can I do next.

guyu23 · September 21, 2023, 8:01am

@ptrblck
Would you mind helping me with this question, please? I really appreciate your expertise on this topic.

ptrblck · September 21, 2023, 1:26pm

Narrow down where the indexing error is raised from and fix it. If you are using any embedding layers check them first as all recent errors were caused by inputs containing invalid values in embeddings while usually users were convinced their data was valid.

guyu23 · September 22, 2023, 2:16am

I appreciate your reply, I’ll check my dataset again.

guyu23 · September 22, 2023, 10:15am

What might be the possible reason that it works on single GPU but gets index error on multi GPUs?
Do you have any ideas?@ptrblck

ptrblck · September 22, 2023, 2:23pm

The data processing could differ and might create invalid inputs.

guyu23 · September 25, 2023, 2:00am

Thanks for your reply, appreciate.

guyu23 · September 26, 2023, 10:39am

Hi @ptrblck, sorry to bother you again, I have narrow down the dataset to 5k, I printed every image path through getitem function before the DDP training stuck to construct a smaller dataset with 5k images. Still it worked in single GPU training without any warnings and errors, but stuck in DDP training and I can’t narrow the dataset further because the printed image paths doesn’t get less after I run DPP training again with the 5k dataset. I wonder if it’s possible that the precise of tensor makes the error because I see the error

Since this project use flash attention which needs 16 bit precise, so I tried to use fp16 and bf16, they all didn’t work

ptrblck · September 26, 2023, 1:30pm

In your initial post NCCL re-raised an indexing error followed by a cublas failure. The first device assert is real and should be fixed. In your example I doubt NCCL is at fault as it just raises the sticky error and since the indexing error was raised first, I assume it’s the root cause. Of course other issues could still be in the code but I would not focus on other libs raising errors after the indexing assert.

yanhong-lbh · November 21, 2023, 12:33am

Hi guyu, would you mind sharing how you fixed it? I am having exactly the same issue but don’t know where it’s coming from. Thanks a lot!!

guyu23 · March 22, 2024, 4:58am

Check whether sequence length is larger than max position embedding length after text prompt embedding and image embedding catted

maxiuw · June 9, 2024, 11:38am

I had the same issue, training runs on 1 gpu, crashes like you described above on 4 (it was in mmdet3d, which also uses torch).
I reduced the the batch size and it worked. I also found solutions that recommended increasing timeout value.

doubleZ · March 26, 2025, 2:16am

Also meet the same error.
Everything works fine on a single machine + single GPU, but when it is deployed to a single machine + multiple GPUs, it gets the NCCL timeout error.
So, what is the root cause?

doubleZ · March 26, 2025, 2:20am

@guyu23
I do image generation training based on SD3. And I have noticed no NCCL timeout when training on a small image resolution (like 768x768), the training process goes fine. But on the condition that the image resolution is (1152x1152 or larger), the timeout error occurs.