RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm' on RTX A5000

I have a same set of code that runs in RTX 1080 but does not work on RTX A5000, with error ** RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) **


I am using following ptr.yml with

name: ptr
channels:
  - nvidia/label/cuda-11.5.1
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - ca-certificates=2023.01.10=h06a4308_0
  - certifi=2022.12.7=py37h06a4308_0
  - cuda=11.5.1=0
  - cuda-cccl=11.5.62=0
  - cuda-command-line-tools=11.5.1=0
  - cuda-compiler=11.5.1=0
  - cuda-cudart=11.5.117=h7e867a7_0
  - cuda-cudart-dev=11.5.117=h91b5d7a_0
  - cuda-cuobjdump=11.5.119=h32764b9_0
  - cuda-cupti=11.5.114=h2757d8a_0
  - cuda-cuxxfilt=11.5.119=h3f39129_0
  - cuda-driver-dev=11.5.117=0
  - cuda-gdb=11.5.114=h8765814_0
  - cuda-libraries=11.5.1=0
  - cuda-libraries-dev=11.5.1=0
  - cuda-memcheck=11.5.114=h765d031_0
  - cuda-nsight=11.5.114=0
  - cuda-nsight-compute=11.5.1=0
  - cuda-nvcc=11.5.119=h2e31d95_0
  - cuda-nvdisasm=11.5.119=he465173_0
  - cuda-nvml-dev=11.5.50=h511b398_0
  - cuda-nvprof=11.5.114=hd1b9a7f_0
  - cuda-nvprune=11.5.119=ha53ebc3_0
  - cuda-nvrtc=11.5.119=h411d788_0
  - cuda-nvrtc-dev=11.5.119=h3fe8e16_0
  - cuda-nvtx=11.5.114=ha1eacfd_0
  - cuda-nvvp=11.5.114=h233c720_0
  - cuda-runtime=11.5.1=0
  - cuda-samples=11.5.56=hf1e648b_0
  - cuda-sanitizer-api=11.5.114=h781f4d3_0
  - cuda-toolkit=11.5.1=0
  - cuda-tools=11.5.1=0
  - cuda-visual-tools=11.5.1=0
  - gds-tools=1.1.1.25=0
  - ld_impl_linux-64=2.38=h1181459_1
  - libcublas=11.7.4.6=hd52c9d2_0
  - libcublas-dev=11.7.4.6=h9ea41a3_0
  - libcufft=10.6.0.107=hd5a0538_0
  - libcufft-dev=10.6.0.107=hb86e5fa_0
  - libcufile=1.1.1.25=0
  - libcufile-dev=1.1.1.25=0
  - libcurand=10.2.7.107=h449470a_0
  - libcurand-dev=10.2.7.107=hd5b7b69_0
  - libcusolver=11.3.2.107=hc875929_0
  - libcusolver-dev=11.3.2.107=h78cb71c_0
  - libcusparse=11.7.0.107=hf21abff_0
  - libcusparse-dev=11.7.0.107=h338262b_0
  - libffi=3.4.2=h6a678d5_6
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libnpp=11.5.1.107=hb6e5806_0
  - libnpp-dev=11.5.1.107=he6b01ee_0
  - libnvjpeg=11.5.4.107=h31e24ca_0
  - libnvjpeg-dev=11.5.4.107=h6455901_0
  - libstdcxx-ng=11.2.0=h1234567_1
  - ncurses=6.4=h6a678d5_0
  - nsight-compute=2021.3.1.4=0
  - openssl=1.1.1t=h7f8727e_0
  - pip=22.3.1=py37h06a4308_0
  - python=3.7.16=h7a1cb2a_0
  - readline=8.2=h5eee18b_0
  - setuptools=65.6.3=py37h06a4308_0
  - sqlite=3.40.1=h5082296_0
  - tk=8.6.12=h1ccaba5_0
  - wheel=0.38.4=py37h06a4308_0
  - xz=5.2.10=h5eee18b_1
  - zlib=1.2.13=h5eee18b_0
  - pip:
    - charset-normalizer==3.0.1
    - click==8.1.3
    - filelock==3.9.0
    - idna==3.4
    - importlib-metadata==6.0.0
    - joblib==1.2.0
    - numpy==1.18.0
    - packaging==23.0
    - regex==2022.10.31
    - requests==2.28.2
    - sacremoses==0.0.53
    - scikit-learn==0.22.1
    - scipy==1.4.1
    - six==1.16.0
    - tokenizers==0.9.4
    - torch==1.4.0
    - tqdm==4.41.1
    - transformers==4.0.0
    - typing-extensions==4.5.0
    - urllib3==1.26.14
    - zipp==3.14.0

with A5000:

+-----------------------------------------------------------------------------+                                                                   
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |                                                                   
|-------------------------------+----------------------+----------------------+                                                                   
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |                                                                   
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |                                                                   
|                               |                      |               MIG M. |                                                                   
|===============================+======================+======================|                                                                   
|   0  NVIDIA RTX A5000    Off  | 00000000:01:00.0 Off |                  Off |                                                                   
| 30%   27C    P0    58W / 230W |      0MiB / 24564MiB |      0%      Default |                                                                   
|                               |                      |                  N/A |                                                                   
+-------------------------------+----------------------+----------------------+                                                                   
|   1  NVIDIA RTX A5000    Off  | 00000000:41:00.0 Off |                  Off |                                                                   
| 30%   26C    P0    62W / 230W |      0MiB / 24564MiB |      0%      Default |

Specifically I am using torch 1.4.0 and cuda-toolkit 11.5.1.

Anyone knows how to fix this?

Your PyTorch version is quite old so could you update to the latest stable (1.13.1) or nightly release, please?
Let me know if you are still seeing the same issue.

Thank you so much for support?
The code is old: https://github.com/thunlp/PTR, when I switch to torch 1.13.0 I found many libraries have been migrated, which torch versions does RTX A5000 support?

(ptr) yerong2@timan108:~/PTR$ bash scripts/run_large_tacred.sh 
Traceback (most recent call last):
  File "src/run_prompt.py", line 1, in <module>
    from arguments import get_args_parser
  File "/shared/home/yerong/PTR/src/arguments.py", line 3, in <module>
    import transformers
  File "/shared/home/yerong/local/Conda/envs/ptr/lib/python3.7/site-packages/transformers/__init__.py", line 626, in <module>
    from .trainer import Trainer
  File "/shared/home/yerong/local/Conda/envs/ptr/lib/python3.7/site-packages/transformers/trainer.py", line 69, in <module>
    from .trainer_pt_utils import (
  File "/shared/home/yerong/local/Conda/envs/ptr/lib/python3.7/site-packages/transformers/trainer_pt_utils.py", line 40, in <module>
    from torch.optim.lr_scheduler import SAVE_STATE_WARNING
ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' (/shared/home/yerong/local/Conda/envs/ptr/lib/python3.7/site-packages/torch/optim/lr_scheduler.py)

The current release PyTorch release supports your A5000 and the error is raised by Huggingface/Transformers here.
I don’t know why the ImportError is not caught as seen in the linked code snippet, but you might need to update transformers to the latest version to avoid these errors.

Updating the torch to 13.0.0 and edit the transformer package did not resolve the issue.

    return forward_call(*input, **kwargs)
  File "/shared/home/yerong/local/Conda/envs/ptr/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 283, in forward
    output_attentions,
  File "/shared/home/yerong/local/Conda/envs/ptr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/shared/home/yerong/local/Conda/envs/ptr/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 200, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

Here is my environment:

name: ptr
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - ca-certificates=2023.01.10=h06a4308_0
  - certifi=2022.12.7=py37h06a4308_0
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.4.2=h6a678d5_6
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - ncurses=6.4=h6a678d5_0
  - openssl=1.1.1t=h7f8727e_0
  - pip=22.3.1=py37h06a4308_0
  - python=3.7.16=h7a1cb2a_0
  - readline=8.2=h5eee18b_0
  - setuptools=65.6.3=py37h06a4308_0
  - sqlite=3.40.1=h5082296_0
  - tk=8.6.12=h1ccaba5_0
  - wheel=0.38.4=py37h06a4308_0
  - xz=5.2.10=h5eee18b_1
  - zlib=1.2.13=h5eee18b_0
  - pip:
    - charset-normalizer==3.0.1
    - click==8.1.3
    - filelock==3.9.0
    - idna==3.4
    - importlib-metadata==6.0.0
    - joblib==1.2.0
    - numpy==1.18.0
    - nvidia-cublas-cu11==11.10.3.66
    - nvidia-cuda-nvrtc-cu11==11.7.99
    - nvidia-cuda-runtime-cu11==11.7.99
    - nvidia-cudnn-cu11==8.5.0.96
    - packaging==23.0
    - regex==2022.10.31
    - requests==2.28.2
    - sacremoses==0.0.53
    - scikit-learn==0.22.1
    - scipy==1.4.1
    - six==1.16.0
    - tokenizers==0.9.4
    - torch==1.13.0
    - tqdm==4.41.1
    - transformers==4.0.0
    - typing-extensions==4.5.0
    - urllib3==1.26.14
    - zipp==3.15.0
prefix: /shared/home/yerong/local/Conda/envs/ptr

Could you post a minimal and executable code snippet to reproduce the issue, please?

Its is too complicated to rewrite a snippet. The straightforward way to do this is to run

git clone https://github.com/thunlp/PTR.git
cd PTR
pip install -r requirements.txt
pip install torch==1.13.0
bash data/download.sh all
bash scripts/run_large_tacred.sh

This code runs on RTX 1080 while the memory issue comes up later in the code, while on A5000 the error CUBLAS_STATUS_NOT_SUPPORTED show up on the first forward. Could you help me with this?

I see there are many similar CUBLAS_STATUS_NOT_SUPPORTED in this forum…

No, sorry I won’t be able to download your (binary) datset.
Would it be possible to get the model definition with input shapes in case this would reproduce the issue?

Reproducing a min snippet is definitely the hardest thing for almost anyone, because these repos are really big.

In one step of forwarding passes, the memory blows up with torch==1.13.0.

Here is the dataset, you can unzip those to the data folder, before you run bash command:
https://drive.google.com/drive/folders/1ilmxc8wjniC-GB31zlz1dbaWFYqg0P7C?usp=share_link

I find the example with minimal script from pytorch official example here:

https://discuss.pytorch.org/t/trainer-train-stuck-with-rtx-a6000/175093

Maybe we can discuss in that post instead.