Hi, I’m training LLAVA using repo: GitHub - haotian-liu/LLaVA: Visual Instruction Tuning: Large Language-and-Vision Assistant built towards multimodal GPT-4 level capabilities.. When I use my own dataset, roughly 50w data, DDP training with 8 A100 80G, the training hangs and gives the following error:
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802710 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15170, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803156 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802713 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15170, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803216 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802791 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802786 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15172, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803288 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15173, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802877 milliseconds before timing out.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [67,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [68,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [69,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [44,0,0], thread: [70,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
...
Traceback (most recent call last):
File "/workdir/llava/train/train_mem.py", line 16, in <module>
train()
File "/workdir/llava/train/train.py", line 930, in train
trainer.train()
File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
loss = self.compute_loss(model, inputs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
outputs = model(**inputs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/workdir/llava/model/language_model/llava_llama.py", line 75, in forward
input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)
File "/workdir/llava/model/llava_arch.py", line 119, in prepare_inputs_labels_for_multimodal
image_features = self.encode_images(images)
File "/workdir/llava/model/llava_arch.py", line 99, in encode_images
image_features = self.get_model().get_vision_tower()(images)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/workdir/llava/model/multimodal_encoder/donut_encoder.py", line 47, in forward
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype))
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/workdir/llava/model/multimodal_encoder/donut.py", line 107, in forward
x = self.model.layers(x)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/timm/models/swin_transformer.py", line 420, in forward
x = self.blocks(x)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/timm/models/swin_transformer.py", line 310, in forward
attn_windows = self.attn(x_windows, mask=self.attn_mask) # num_win*B, window_size*window_size, C
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/timm/models/swin_transformer.py", line 216, in forward
x = (attn @ v).transpose(1, 2).reshape(B_, N, -1)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Traceback (most recent call last):
File "/workdir/llava/train/train_mem.py", line 16, in <module>
train()
File "/workdir/llava/train/train.py", line 930, in train
trainer.train()
File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in training_step
self.accelerator.backward(loss)
File "/miniconda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 1853, in backward
loss.backward(**kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: NCCL communicator was aborted on rank 6. Original reason for failure was: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15170, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803156 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
=====================================================
At the beginning, I thought maybe some corrupt images lead to the error, because I see cuda index error in above message, and traceback show the error in swin transformer, but I checked all images use PIL Image.open and deleted all images with warning, no problem found, the training still stuck. I also check input image tensor size and they are right.
I searched many way in community, like use the following environment parameter:
CUDA_LAUNCH_BLOCKING= 1
NCCL_P2P_LEVEL=2
NCCL_P2P_DISABLE=1
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=ALL
TORCH_DISTRIBUTED_DEBUG=INFO
NCCL_IB_TIMEOUT=22
NCCL_BLOCKING_WAIT=0
unset LD_LIBRARY_PATH
it still didn’t work.
Then I tried to use 2 GPU training and batch size per device use 1, and print image path to find the stuck data, but I found the data is ok, and I constructed a dataset only contained the 2 images, the training process didn’t stuck and worked.
However, when I training on single GPU, it works fine, when I training use other datasets on DDP mode, it works fine.
So I think the code is ok and it seems there are some problems in the dataset but since single GPU worked and the dataset once used to training other model before, it seems no problems in the dataset.
I also use the following code at beginning of train.py:
torch.distributed.init_process_group(backend="gloo")
just get the error message:
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3772 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3773 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3774 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3775 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3776 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3779 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 5 (pid: 3777) of binary: /miniconda/envs/llava/bin/python3
Traceback (most recent call last):
File "/miniconda/envs/llava/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/miniconda/envs/llava/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
main()
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/miniconda/envs/llava/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
llava/train/train_mem.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-09-21_12:47:26
host : psx1kopxqb355ls7-worker-0
rank : 5 (local_rank: 5)
exitcode : -6 (pid: 3777)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3777
=====================================================
I’m so confused, and I don’t know what can I do next.