I tried to finetune blip2 model for a custom dataset. I could train the model for one epoch after that I am getting the following error. inside the training file batch size is 6.
error at step 1 in epoch 2: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Traceback (most recent call last):
File “/workspace/train.py”, line 518, in
outputs = combined_model(**batch)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py”, line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py”, line 1899, in forward
loss = self.module(*inputs, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/peft/peft_model.py”, line 762, in forward
return self.get_base_model()(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/workspace/train.py”, line 257, in forward
outputs = self.language_model(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/opt/modeling_opt.py”, line 1011, in forward
outputs = self.model.decoder(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/opt/modeling_opt.py”, line 777, in forward
layer_outputs = decoder_layer(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/opt/modeling_opt.py”, line 418, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/opt/modeling_opt.py”, line 192, in forward
attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/workspace/train.py”, line 537, in
torch.cuda.empty_cache() # Free memory if an error occurs
File “/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py”, line 133, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at …/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7d5e052a34d7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7d5e0526d36b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7d5e0533fb58 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1c36b (0x7d5e0531036b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x2b930 (0x7d5e0531f930 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d5a16 (0x7d5dfe978a16 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3ee77 (0x7d5e05288e77 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7d5e0528169e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7d5e052817b9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: + 0x75afc8 (0x7d5dfebfdfc8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: + 0x12f715 (0x6094db5da715 in /usr/bin/python)
frame #11: + 0x268a60 (0x6094db713a60 in /usr/bin/python)
frame #12: Py_FinalizeEx + 0x148 (0x6094db70fb98 in /usr/bin/python)
frame #13: Py_RunMain + 0x173 (0x6094db7012d3 in /usr/bin/python)
frame #14: Py_BytesMain + 0x2d (0x6094db6d7cad in /usr/bin/python)
frame #15: + 0x29d90 (0x7d5e2fd71d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #16: __libc_start_main + 0x80 (0x7d5e2fd71e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #17: _start + 0x25 (0x6094db6d7ba5 in /usr/bin/python)
[2024-09-08 10:40:37,800] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2122
[2024-09-08 10:40:37,801] [ERROR] [launch.py:325:sigkill_handler] [‘/usr/bin/python’, ‘-u’, ‘train.py’, ‘–local_rank=0’, ‘–deepspeed_config’, ‘ds_config.json’] exits with return code = -6
To train the model I used following deep speed configurations
{
“train_batch_size”: 8,
“gradient_accumulation_steps”: 4,
“optimizer”: {
“type”: “Adam”,
“params”: {
“lr”: 3e-3,
“betas”: [0.9, 0.999],
“eps”: 1e-8,
“weight_decay”: 1e-2
}
},
“zero_optimization”: {
“stage”: 2,
“allgather_partitions”: true,
“allgather_bucket_size”: 5e8,
“reduce_scatter”: true,
“reduce_bucket_size”: 5e8,
“overlap_comm”: true,
“contiguous_gradients”: true,
“offload_param”: {
“device”: “cpu”,
“pin_memory”: true
}
},
“fp16”: {
“enabled”: true,
“loss_scale”: 1024 # Reduce initial loss scale
}
}