CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

George_Guliman · April 1, 2021, 12:27pm

Hello,

I am trying to run a simple model using GPU acceleration. I am currently encountering 2 different issues with this.

Whatever cuda-pytorch combination I use, it always takes around 15 minutes to execute the first instruction on the GPU ( no matter of the instruction executed).
For my model, I always and error although, it runs fine on the CPU.
For 1), i am trying to install pytorch from source to see if anything works differently.
For 2), I am completely stuck.

The model I am using is below:

import matplotlib.pyplot as plt
import torch
from torchvision import datasets, transforms
import helper
import numpy as np

data_dir = ‘…/Cat_Dog_data’

TODO: Define transforms for the training data and testing data

train_transforms = transforms.Compose([transforms.RandomRotation(30),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.Grayscale(),
transforms.ToTensor()])

test_transforms = transforms.Compose([transforms.Resize(255),
transforms.CenterCrop(224),
transforms.Grayscale(),
transforms.ToTensor()])

Pass transforms in here, then run the next cell to see how the transforms look

train_data = datasets.ImageFolder(data_dir + ‘/train’, transform=train_transforms)
test_data = datasets.ImageFolder(data_dir + ‘/test’, transform=test_transforms)

trainloader = torch.utils.data.DataLoader(train_data, batch_size=32)
testloader = torch.utils.data.DataLoader(test_data, batch_size=32)
len(test_data)

from torch import nn, optim
import torch.nn.functional as F

if torch.cuda.is_available():

dev = “cuda:0”

else:

dev = “cpu”

print("device is: " + dev)

device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
#device = torch.device(dev)

class Classifier(nn.Module):
def init(self):
super().init()
self.fc0 = nn.Linear(50176, 784)
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 64)
self.fc4 = nn.Linear(64, 2)
def forward(self, x):
    # make sure input tensor is flattened
    print('The shape of X is: ')
    print(x.shape)
    x = x.view(x.shape[0], -1)
    print('The shape of X flattened is: ')
    print(x.shape)
    
    x = F.relu(self.fc0(x))
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = F.relu(self.fc3(x))
    x = F.log_softmax(self.fc4(x), dim=1)
    
    print('The shape of X is: ')
    print(x.shape)
    
    return x
model= Classifier()
model = model.to(device)
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0003)

print(“Our model: \n\n”, model, ‘\n’)
print(“The state dict keys: \n\n”, model.state_dict().keys())

epochs = 2
steps = 0

train_losses, test_losses, test_acc = , ,
#Training pass
for e in range(epochs):
running_loss = 0
for images, labels in trainloader:
images = images.to(device)
print("Device for images is: ", images.get_device())
labels = labels.to(device)
optimizer.zero_grad()
    log_ps = model(images)
    loss = criterion(log_ps, labels)
    loss.backward()
    optimizer.step()
   
    running_loss +=loss.item()

#Validation pass
else:
    with torch.no_grad():
        test_loss = 0
        running_accuracy = 0
        for images, labels in testloader:
            images = images.to(device)
            labels = labels.to(device)
            
            optimizer.zero_grad()

            print("Asta vrem, asta vreem: ", images.shape)
            log_psV = model(images)
            print(log_psV.shape, labels.shape)
            test_loss += criterion(log_psV, labels)

            psV = torch.exp(log_psV)
            top_p, top_class = psV.topk(1, dim=1)
            equals = top_class == labels.view(*top_class.shape)
            accuracy = torch.mean(equals.type(torch.FloatTensor))
            running_accuracy += accuracy.item()
            
    train_losses.append(running_loss/len(trainloader))
    test_losses.append(test_loss/len(testloader))
    test_acc.append(running_accuracy/len(testloader))
    
    print(f'Epoch: {e+1} epochs')
    print(f'Training Loss: {running_loss/len(trainloader)}')
    print(f'Test Loss: {test_loss/len(testloader)}')
    print(f'Test Accuracy: {(running_accuracy/len(testloader))*100}%')
plt.plot(range(epochs), train_losses, label=‘Training Loss’)
plt.plot(range(epochs), test_losses, label=‘Test Loss’)
plt.plot(range(epochs), test_acc, label=‘Accuracy’)
plt.legend()

The output and error I am getting is this:

Our model:

Classifier(
(fc0): Linear(in_features=50176, out_features=784, bias=True)
(fc1): Linear(in_features=784, out_features=256, bias=True)
(fc2): Linear(in_features=256, out_features=128, bias=True)
(fc3): Linear(in_features=128, out_features=64, bias=True)
(fc4): Linear(in_features=64, out_features=2, bias=True)
)

The state dict keys:

odict_keys([‘fc0.weight’, ‘fc0.bias’, ‘fc1.weight’, ‘fc1.bias’, ‘fc2.weight’, ‘fc2.bias’, ‘fc3.weight’, ‘fc3.bias’, ‘fc4.weight’, ‘fc4.bias’])
Device for images is: 0
The shape of X is:
torch.Size([32, 1, 224, 224])
The shape of X flattened is:
torch.Size([32, 50176])

RuntimeError Traceback (most recent call last)
in
63 optimizer.zero_grad()
64
—> 65 log_ps = model(images)
66 loss = criterion(log_ps, labels)
67 loss.backward()

~.conda\envs\pytorch18-cuda111\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
→ 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),

in forward(self, x)
31 print(x.shape)
32
—> 33 x = F.relu(self.fc0(x))
34 x = F.relu(self.fc1(x))
35 x = F.relu(self.fc2(x))

~.conda\envs\pytorch18-cuda111\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
→ 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),

~.conda\envs\pytorch18-cuda111\lib\site-packages\torch\nn\modules\linear.py in forward(self, input)
92
93 def forward(self, input: Tensor) → Tensor:
—> 94 return F.linear(input, self.weight, self.bias)
95
96 def extra_repr(self) → str:

~.conda\envs\pytorch18-cuda111\lib\site-packages\torch\nn\functional.py in linear(input, weight, bias)
1751 if has_torch_function_variadic(input, weight):
1752 return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
→ 1753 return torch._C._nn.linear(input, weight, bias)
1754
1755

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Does anyone have any clue on why I am getting this behaviour?

Thank you in advance,

ptrblck · April 2, 2021, 6:00am

This sounds as if the JIT would kick in and compile the CUDA kernels for your architecture, which seem to be missing from the used binary.
How did you install/build PyTorch and which GPU are you using?
Once 1. is solved, this might also be fixed.

George_Guliman · April 2, 2021, 2:42pm

Hello @ptrblck ,

I have an RTX2070 GPU and I tried installing torch with cuda via conda, by following the instructions from here https://pytorch.org/.
I tried with both cuda 10.2 and cuda 11.1. I also tried installing the conda packages or to directly install the packages via pip, in the conda env.
I am now trying to install from source but I am stuck here as well with this problem

ptrblck · April 2, 2021, 6:51pm

Which PyTorch version are you installing from conda, as I would like to reproduce this issue?
Also, what kind of model are you running?

George_Guliman · April 2, 2021, 7:10pm

I am using the following versions:
python 3.8.5 h5fd99cc_1
pytorch 1.8.1 py3.8_cuda11.1_cudnn8_0
torchaudio 0.8.1 py38
torchvision 0.9.1 py38_cu111

For the model, I am using a simple MLP network with 4 hidden layers. I posted the entire source code above.

The code just for the network is:

class Classifier(nn.Module):
def init (self):
super().init ()
self.fc0 = nn.Linear(50176, 784)
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 64)
self.fc4 = nn.Linear(64, 2)
def forward(self, x):
# make sure input tensor is flattened
print('The shape of X is: ')
print(x.shape)
x = x.view(x.shape[0], -1)
print('The shape of X flattened is: ')
print(x.shape)
x = F.relu(self.fc0(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = F.log_softmax(self.fc4(x), dim=1)

print('The shape of X is: ')
print(x.shape)

return x

George_Guliman · April 10, 2021, 7:52am

Hello @ptrblck ,

So it seems that it really isn’t any solution for me to use pytorch with cuda

I am completely stuck and frustration is kicking in…

It seems that I am reaching the same dead end, even if I install pytorch from source or if I use the provided binaries. The below output is for pytorch nightly build with cuda 10.2:

import torch
x=torch.randn(1024, 1024).cuda()
y = torch.matmult(x, x) <------ this takes around 20 minutes to execute
Traceback (most recent call last):
File “”, line 1, in
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling >cublasCreate(handle)
torch.version
‘1.9.0.dev20210409’

torch.version.cuda
‘10.2’
torch.cuda.get_arch_list()
[‘sm_37’, ‘sm_50’, ‘sm_60’, ‘sm_61’, ‘sm_70’, ‘sm_75’, ‘compute_37’]
torch.cuda.get_device_name()
‘GeForce RTX 2070’

The same results I got on the installed from source variant of pytorch. Details https://discuss.pytorch.org/t/pytorch-cuda-11-2-build-from-source-runtimeerror-cuda-error-no-kernel-image-is-available-for-execution-on-the-device/116392/7

Any help, any tip is greatly appreciated.

Thank you,

ptrblck · April 10, 2021, 8:37am

I don’t know what might be causing this issue, as it seems no workflow (neither the binaries nor a source build) fixes it in your setup.
Unfortunately I’m unable to reproduce the issue using:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc0 = nn.Linear(50176, 784)
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, 2)

    def forward(self, x):
        # make sure input tensor is flattened
        print('The shape of X is: ')
        print(x.shape)
        x = x.view(x.shape[0], -1)
        print('The shape of X flattened is: ')
        print(x.shape)

        x = F.relu(self.fc0(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.log_softmax(self.fc4(x), dim=1)

        print('The shape of X is: ')
        print(x.shape)

        return x

device = 'cuda:0'
print(torch.cuda.get_device_name(device))

model = Classifier().to(device)
for i in [1, 8, 16, 32]:
    x = torch.randn(i, 50176, device=device)
    out = model(x)
    print(out.shape)

on an RTX2080Ti and the 1.8.1+CUDA10.2 binaries, so I guess your system setup might not work properly and would recommend to update the NVIDIA drivers etc.

George_Guliman · April 12, 2021, 9:59pm

ok, i will try it on linux. hopes it works there :))

Haris_Cheong · June 4, 2021, 6:19am

Im having the same issue. I realized that the torch.nn.Linear layers are the problem as when I change it to a fully ocnvolutional network it runs without the cublas error.

System Specs:

NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 (installed with runfile so drivers come together)
Installed pytorch using this command: pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-73-generic x86_64)
GeForce GTX 1080 Ti

To reproduce:

import os
os.environ['CUDA_VISIBLE_DEVICES']='0'
import torch

class Classifier(torch.nn.Module):
    def __init__(self):
        super().__init__()
        #classifier
        self.linear_1 = torch.nn.Linear(32*8*4*8,32*8*4)
        self.linear_2 = torch.nn.Linear(32*8*4,32*8)
        self.batch_norm_1 = torch.nn.BatchNorm1d(32*8)

        self.linear_3 = torch.nn.Linear(32*8,32)
        self.linear_4 = torch.nn.Linear(32,1)

        self.activation = torch.nn.ReLU()


    def forward(self,x):
        x = x.reshape(x.shape[0], -1)
        x = self.linear_1(x)
        x = self.activation(x)
        x = self.linear_2(x)
        x = self.activation(x)
        x = self.batch_norm_1(x)

        x = self.linear_3(x)
        x = self.activation(x)
        x = self.linear_4(x)
        return x
    
net = Classifier().cuda()
inp = torch.rand(2,32,8,4,8).cuda()
gt = torch.rand(2,1).cuda()
outp = net(inp)
loss = torch.nn.BCEWithLogitsLoss()(outp, gt)
loss.backward()

Error observed:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-a9db9cb37cd0> in <module>
     35 outp = net(inp)
     36 loss = torch.nn.BCEWithLogitsLoss()(outp, gt)
---> 37 loss.backward()

/disk4/haris/envs/lib/python3.8/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    243                 create_graph=create_graph,
    244                 inputs=inputs)
--> 245         torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
    246 
    247     def register_hook(self, hook):

/disk4/haris/envs/lib/python3.8/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    143         retain_graph = create_graph
    144 
--> 145     Variable._execution_engine.run_backward(
    146         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    147         allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

ptrblck · June 4, 2021, 4:54pm

You are most likely hitting this issue, which is already fixed in the nightly CUDA11 pip wheels (and was not an issue in other binary configs), so you could either update to the nightly pip wheel, or use the CUDA10.2 or conda binaries.

Siladittya_Manna · July 1, 2021, 8:28pm

I faced the same issue when trying to implement the SimCLR framework.

The files are available here
(ssl_models/src at main · sadimanna/ssl_models · GitHub)

To reproduce the error I would suggest running ‘python main.py --batch_size 8 --download True’. This will download the CIFAR-10 dataset (163MB).

But when I run the same code on Colab (Google Colaboratory) I don’t get the error.

I uploaded the files on Colab and ran the file main.py and got the error again. The notebook with the error is below
(Google Colaboratory)

I can’t understand if it is an issue of CUDA or anything else??

ptrblck · July 2, 2021, 2:31am

Have you checked the linked issue and made sure to either update to 1.9.0 or use the conda binaries?

Siladittya_Manna · July 2, 2021, 3:06am

I had the issue on my Laptop too. I had 1.9.0 installed on it. And Cuda 11.1. Python version is 3.9.

It is confusing how the same code produces different results when ran in a different way. The code in the first Colab notebook link I gave in my last reply is just segregated into separate files, like one fike for optimizer, loss and data modules (this code is in the github link) when I run it I get the error. Which is strange because both are the same code.

Also on Google Colab the torch version is 1.9.0 and CUDA 10.2 and the error still occurred.

ptrblck · July 2, 2021, 6:16am

Thanks for the update. Could you post a minimal, executable code snippet to reproduce the issue as well as the output of python -m torch.utils.collect_env?

Siladittya_Manna · July 2, 2021, 6:59am

The code below should reproduce the error

!git clone https://github.com/sadimanna/ssl_models.git
!pip install pytorch-lightning pytorch-lightning-bolts
!python ssl_models/src/main.py --batch_size 16 --gpus 1 --download True

Please reply if it does not. Otherwise, the outputs can be seen in this Colab notebook (Google Colaboratory)

Output from python -m torch.utils.collect_env ?

Collecting environment information...
PyTorch version: 1.9.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0
Libc version: glibc-2.26

Python version: 3.7 (64-bit runtime)
Python platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 11.0.221
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.9.0+cu102
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.10.0
[pip3] torchvision==0.10.0+cu102
[conda] Could not collect

ptrblck · July 4, 2021, 9:20pm

Thanks for the repro instructions.
This error is raised when executing it:

Epoch 1
  0%|                                                                                                                                                                          | 0/2500 [00:01<?, ?batch/s]
Traceback (most recent call last):
  File "ssl_models/src/main.py", line 96, in <module>
    main(args)
  File "ssl_models/src/main.py", line 53, in main
    trainer.fit()
  File "/workspace/src/ssl_models/src/trainer.py", line 51, in fit
    train_epoch_loss = self.train_epoch(self.model, self.train_loader, self.optimizer)
  File "/workspace/src/ssl_models/src/trainer.py", line 83, in train_epoch
    train_loss = model.training_step(batch, step)
  File "/workspace/src/ssl_models/src/simclr.py", line 115, in training_step
    loss = self.criterion(x1,x2)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/src/ssl_models/src/losses.py", line 32, in forward
    sim = self.similarity_f(z.unsqueeze(1), z.unsqueeze(0)) / self.temperature
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/distance.py", line 75, in forward
    return F.cosine_similarity(x1, x2, self.dim, self.eps)
RuntimeError: cosine_similarity requires both inputs to have the same sizes, but x1 has [32, 1, 128] and x2 has [1, 32, 128]

Siladittya_Manna · July 5, 2021, 3:02am

Thanks for the reply.
I tried downgrading PyTorch and CUDA both on my laptop. Just a few hours ago.

I found out that the error was in the mismatch of dimensions between the input and the weights in the Linear layer I was using.

Cosine Similarity, however, is working fine on my laptop. And even in colab. I don’t understand why you are getting this error.

I then upgraded from torch 1.8.1 + cuda 10.2 to torch 1.9.0 + cuda 10.2 and then to cuda 11.1
It is actually working without any errors for now.

kkuchynskyi · July 18, 2021, 6:58pm

I also had the same problem. I my case the error was due incorrect input tensor into nn.Linear layer. The layer was initialized this way nn.Linear(512 * 7 * 7, 4096), but actual input tensor shape before resizing was [512, 6, 6].

My configuration: ubuntu, python 3.9, cuda 11.2, torch 1.9, rtx 3060

my3bikaht · October 20, 2021, 6:53am

Had same issue this morning too. Yet in my case it worked fine before, until I added new GPU via riser cable. Forced PCIe slot to switch to Gen3 mode, no issues for the last 30 min of training.
So, problem can be not just software but hardware too.

Jo-w · March 21, 2022, 2:19pm

Hi, I got the same issue and have solved it by adding a flatten layer before the linear layer. Feel free to try it out.