: CUDA error: an illegal memory access was encountered

I tried to finetune blip2 model for a custom dataset. I could train the model for one epoch after that I am getting the following error. inside the training file batch size is 6.

error at step 1 in epoch 2: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File “/workspace/train.py”, line 518, in
outputs = combined_model(**batch)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py”, line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py”, line 1899, in forward
loss = self.module(*inputs, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/peft/peft_model.py”, line 762, in forward
return self.get_base_model()(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/workspace/train.py”, line 257, in forward
outputs = self.language_model(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/opt/modeling_opt.py”, line 1011, in forward
outputs = self.model.decoder(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/opt/modeling_opt.py”, line 777, in forward
layer_outputs = decoder_layer(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/opt/modeling_opt.py”, line 418, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/opt/modeling_opt.py”, line 192, in forward
attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/workspace/train.py”, line 537, in
torch.cuda.empty_cache() # Free memory if an error occurs
File “/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py”, line 133, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of ‘c10::Error’
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at …/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7d5e052a34d7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7d5e0526d36b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7d5e0533fb58 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1c36b (0x7d5e0531036b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x2b930 (0x7d5e0531f930 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d5a16 (0x7d5dfe978a16 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3ee77 (0x7d5e05288e77 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7d5e0528169e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7d5e052817b9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: + 0x75afc8 (0x7d5dfebfdfc8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #10: + 0x12f715 (0x6094db5da715 in /usr/bin/python)
frame #11: + 0x268a60 (0x6094db713a60 in /usr/bin/python)
frame #12: Py_FinalizeEx + 0x148 (0x6094db70fb98 in /usr/bin/python)
frame #13: Py_RunMain + 0x173 (0x6094db7012d3 in /usr/bin/python)
frame #14: Py_BytesMain + 0x2d (0x6094db6d7cad in /usr/bin/python)
frame #15: + 0x29d90 (0x7d5e2fd71d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #16: __libc_start_main + 0x80 (0x7d5e2fd71e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #17: _start + 0x25 (0x6094db6d7ba5 in /usr/bin/python)

[2024-09-08 10:40:37,800] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2122
[2024-09-08 10:40:37,801] [ERROR] [launch.py:325:sigkill_handler] [‘/usr/bin/python’, ‘-u’, ‘train.py’, ‘–local_rank=0’, ‘–deepspeed_config’, ‘ds_config.json’] exits with return code = -6

To train the model I used following deep speed configurations

{
“train_batch_size”: 8,
“gradient_accumulation_steps”: 4,
“optimizer”: {
“type”: “Adam”,
“params”: {
“lr”: 3e-3,
“betas”: [0.9, 0.999],
“eps”: 1e-8,
“weight_decay”: 1e-2
}
},
“zero_optimization”: {
“stage”: 2,
“allgather_partitions”: true,
“allgather_bucket_size”: 5e8,
“reduce_scatter”: true,
“reduce_bucket_size”: 5e8,
“overlap_comm”: true,
“contiguous_gradients”: true,
“offload_param”: {
“device”: “cpu”,
“pin_memory”: true
}
},
“fp16”: {
“enabled”: true,
“loss_scale”: 1024 # Reduce initial loss scale
}
}

Could you try to isolate the failing kernel by creating a CUDA coredump or by running your workload via cuda-gdb? If you are using an older PyTorch release, could you also update to the latest one before rerunning it?

I am working on runpod. pod infor :1 x RTX A6000 16 vCPU 62 GB RAM . batch-size: 6. I will try to change the pytorch version and re run it. Can I create CUDA coredump inside rundpod. thanks ptrblck

I trained as you advised by updating the pytorch but I am still getting an error. I could train only one epoch after that I got following error.

Error at step 1 in epoch 2: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank0]: Traceback (most recent call last):
[rank0]: File “/workspace/train.py”, line 519, in
[rank0]: outputs = combined_model(**batch)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py”, line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/deepspeed/runtime/engine.py”, line 1899, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/peft/peft_model.py”, line 762, in forward
[rank0]: return self.get_base_model()(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File “/workspace/train.py”, line 257, in forward
[rank0]: outputs = self.language_model(
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py”, line 1011, in forward
[rank0]: outputs = self.model.decoder(
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py”, line 777, in forward
[rank0]: layer_outputs = decoder_layer(
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py”, line 418, in forward
[rank0]: hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py”, line 192, in forward
[rank0]: attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
[rank0]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File “/workspace/train.py”, line 538, in
[rank0]: torch.cuda.empty_cache() # Free memory if an error occurs
[rank0]: File “/workspace/myenv/lib/python3.10/site-packages/torch/cuda/memory.py”, line 170, in empty_cache
[rank0]: torch._C._cuda_emptyCache()
[rank0]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-09-12 16:32:23,067] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1475
[2024-09-12 16:32:23,067] [ERROR] [launch.py:325:sigkill_handler] [‘/workspace/myenv/bin/python3’, ‘-u’, ‘train.py’, ‘–local_rank=0’, ‘–deepspeed_config’, ‘ds_config.json’] exits with return code = 1