Hi everyone,
I’m training a model that uses roberta to encoder sentence. I got an error that is:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
My problem is that after training process is done first epoch and evaluation then the second epoch i got this error. I don’t really know what happended. After a while of searching I still couldn’t find any answer.
Note: I use NVIDIA GTX 1080 ti (12GB VRAM), i had checked that there is no OOM happened.
There is my environment:
PyTorch version: 1.8.0+cu111 Is debug build: False CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 520.61.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch-geometric==1.7.0
[pip3] torch-scatter==2.0.9
[pip3] torch-sparse==0.6.17
[pip3] torchaudio==0.13.0
[pip3] torchvision==0.12.0+cu113
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] torch 1.8.0+cu111 pypi_0 pypi
[conda] torch-geometric 1.7.0 pypi_0 pypi
[conda] torch-scatter 2.0.7 pypi_0 pypi
[conda] torch-sparse 0.6.9 pypi_0 pypi
There is full error message:
Traceback (most recent call last): File “logignn.py”, line 446, in main() File “logignn.py”, line 99, in main train(args) File “logignn.py”, line 262, in train logits, _ = model(*[x[a:b] for x in input_data], layer_id=args.encoder_layer) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/ptkhai/logiqa/modeling/modeling_logignn.py”, line 230, in forward sent_vecs, all_hidden_states = self.encoder(*lm_inputs, layer_id=layer_id) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/ptkhai/logiqa/modeling/modeling_encoder.py”, line 43, in forward outputs = self.module(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 677, in forward encoder_outputs = self.encoder( File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 418, in forward layer_outputs = layer_module( File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 339, in forward self_attention_outputs = self.attention( File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 271, in forward self_outputs = self.self( File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl result = self.forward(*input, **kwargs) File “/home/ndthuc/anaconda3/envs/logignn/lib/python3.8/site-packages/transformers/modeling_roberta.py”, line 194, in forward attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)