CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when using roberta

ptkhai1203 · October 2, 2023, 8:25am

Hi everyone,

I’m training a model that uses roberta to encoder sentence. I got an error that is:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

My problem is that after training process is done first epoch and evaluation then the second epoch i got this error. I don’t really know what happended. After a while of searching I still couldn’t find any answer.
Note: I use NVIDIA GTX 1080 ti (12GB VRAM), i had checked that there is no OOM happened.

There is my environment:
PyTorch version: 1.8.0+cu111 Is debug build: False CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.26.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 520.61.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch-geometric==1.7.0
[pip3] torch-scatter==2.0.9
[pip3] torch-sparse==0.6.17
[pip3] torchaudio==0.13.0
[pip3] torchvision==0.12.0+cu113
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] torch 1.8.0+cu111 pypi_0 pypi
[conda] torch-geometric 1.7.0 pypi_0 pypi
[conda] torch-scatter 2.0.7 pypi_0 pypi
[conda] torch-sparse 0.6.9 pypi_0 pypi

There is full error message:
Traceback (most recent call last): File “logignn.py”, line 446, in main() File “logignn.py”, line 99, in main train(args) File “logignn.py”, line 262, in train logits, _ = model(*[x[a:b] for x in input_data], layer_id=args.encoder_layer) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/ptkhai/logiqa/modeling/modeling_logignn.py”, line 230, in forward sent_vecs, all_hidden_states = self.encoder(*lm_inputs, layer_id=layer_id) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/ptkhai/logiqa/modeling/modeling_encoder.py”, line 43, in forward outputs = self.module(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 677, in forward encoder_outputs = self.encoder( File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 418, in forward layer_outputs = layer_module( File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 339, in forward self_attention_outputs = self.attention( File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 271, in forward self_outputs = self.self( File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 194, in forward attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

ptrblck · October 3, 2023, 1:12pm

Could you update PyTorch to the latest stable or nightly release and check if this error still exists?

ptkhai1203 · October 11, 2023, 8:49am

Thankyou for your reply,
I have to use this pytorch version because requirement of the code is pytorch 1.8.0 with cuda111.
So are there any way for me to resolve this problem?

Best regards

ptrblck · October 11, 2023, 2:03pm

It’s unclear if cublas fails directly or is the victim of another failing operation creating a sticky error.
You could rerun the code via CUDA_LAUNCH_BLOCKING=1 to check if the error would stay the same and check which line of code fails.
Note that older PyTorch releases and CUDA versions won’t be fixed, so if you still encounter it you would need to update.

ptkhai1203 · October 11, 2023, 2:05pm

Thanks for your help.

I will try it.