RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)


I had an error when running a script using torch:

23-03-10 17:24:37.413 - INFO: Model [DDPM] is created.
23-03-10 17:24:37.413 - INFO: Initial Model Finished
Traceback (most recent call last):
File “”, line 69, in
File “/data/work/Diffusion/DDM/model/”, line 53, in optimize_parameters
score, loss = self.netG(, self.loss_lambda)
File “/home/lomahu/.local/lib/python3.8/site-packages/torch/nn/modules/”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/data/work/Diffusion/DDM/model/ddpm_modules/”, line 253, in forward
return self.p_losses(x, loss_lambda, *args, **kwargs)
File “/data/work/Diffusion/DDM/model/ddpm_modules/”, line 238, in p_losses
code = self.denoise_fn([x_in[‘S’], x_in[‘T’], x_t], dim=1), t)
File “/home/lomahu/.local/lib/python3.8/site-packages/torch/nn/modules/”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/data/work/Diffusion/DDM/model/ddpm_modules/”, line 225, in forward
x = layer(x, t)
File “/home/lomahu/.local/lib/python3.8/site-packages/torch/nn/modules/”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/data/work/Diffusion/DDM/model/ddpm_modules/”, line 136, in forward
x = self.attn(x)
File “/home/lomahu/.local/lib/python3.8/site-packages/torch/nn/modules/”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/data/work/Diffusion/DDM/model/ddpm_modules/”, line 117, in forward
context = torch.einsum(‘bhdn,bhen->bhde’, k, v)
File “/home/lomahu/.local/lib/python3.8/site-packages/torch/”, line 378, in einsum
return _VF.einsum(equation, operands) # type: ignore[attr-defined]

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

It looks like the problem came from calling einsum function?

To test my environment setting, I went to run a test script provided earlier posted and got a similar error:

Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
_CudaDeviceProperties(name=‘NVIDIA RTX A6000’, major=8, minor=6, total_memory=48676MB, multi_processor_count=84)
import torch.nn as nn
rr = torch.zeros([2,20,5000]).to(device)
layer1 = nn.Conv1d(20,500,kernel_size=4,stride=4,groups=20,bias=False).to(device)
layer2 = nn.Linear(500,768).to(device)
l1out = layer1(rr)
l2out = layer2(l1out.transpose(1,2))
Traceback (most recent call last):
File “”, line 1, in
File “/home/lomahu/.local/lib/python3.8/site-packages/torch/nn/modules/”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/home/lomahu/.local/lib/python3.8/site-packages/torch/nn/modules/”, line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

Here is my outputs:
Collecting environment information…
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-144-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.5.119
GPU models and configuration:

Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] denoising-diffusion-pytorch==1.2.2
[pip3] ema-pytorch==0.2.1
[pip3] lion-pytorch==0.0.5
[pip3] numpy==1.23.1
[pip3] torch==1.13.1
[pip3] torchvision==0.14.1
[conda] Could not collect

Many thanks for any inputs on the error when calling “cublasSgemmStridedBatched”.

J. L.

Could you post a minimal and executable code snippet to reproduce the issue and wrap it into three backticks ```, please?

Thanks @ptrblck for your prompt response to my question. Here is the snippet I used to generate the problem although the snippet was not doing exactly the same thing I was doing in the original problem.

import torch
_CudaDeviceProperties(name=‘NVIDIA RTX A6000’, major=8, minor=6, total_memory=48676MB, multi_processor_count=84)
import torch.nn as nn
rr = torch.zeros([2,20,5000]).to(device)
layer1 = nn.Conv1d(20,500,kernel_size=4,stride=4,groups=20,bias=False).to(device)
layer2 = nn.Linear(500,768).to(device)
l1out = layer1(rr)
l2out = layer2(l1out.transpose(1,2))

Traceback (most recent call last):
File “”, line 1, in
File “/home/lomahu/.local/lib/python3.8/site-packages/torch/nn/modules/”, line 1194, in _call_impl
return forward_call(*input, **kwargs)
File “/home/lomahu/.local/lib/python3.8/site-packages/torch/nn/modules/”, line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

I highly suspected that the problem is related to my versions of torch, tensorflow or RTX A6000 GPU cards. Your help is much appreciated!


Any solutions so far?
I also encounter the same issue, especially when I use einsum.

Hi, I’m having same error.


PyTorch version: 1.12.1
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-172-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.6.55
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090
GPU 4: NVIDIA GeForce RTX 3090
GPU 5: NVIDIA GeForce RTX 3090
GPU 6: NVIDIA GeForce RTX 3090
GPU 7: NVIDIA GeForce RTX 3090

Nvidia driver version: 550.54.14
cuDNN version: Probably one of the following:
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==1.12.1
[pip3] torchaudio==0.12.1
[pip3] torchvision==0.13.1
[conda] blas                      1.0                         mkl    conda-forge
[conda] cudatoolkit               11.6.2              hfc3e2af_13    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344
[conda] numpy                     1.26.4          py310hb13e2d6_0    conda-forge
[conda] pytorch                   1.12.1          py3.10_cuda11.6_cudnn8.3.2_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.12.1              py310_cu116    pytorch
[conda] torchvision               0.13.1              py310_cu116    pytorch


nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
conda 24.1.1

Here are simple code to reproduce

>>> import torch
>>> t1 = torch.randn(4,12,1024,64).cuda()
>>> t2 = torch.randn(4,12,1024,64).cuda()
>>> t = torch.matmul(t1, t2.transpose(-1,-2))
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)