Linear layer throwing CUBLAS_STATUS_INVALID_VALUE error

I recently updated my Python/PyTorch/cudatoolkit to 3.10.8, 1.13.1, and 11.7, respectively, on a computer cluster.

I started seeing the following error (CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)) at a specific layer in a ML model I am running. This error was reproduced when I simplified it to 2 layers (nn.Conv1d and then nn.Linear). Wondering why this is happening when it wasn’t an issue before.

Code:

device = 'cuda:0'
rr = torch.zeros([2,20,5000]).to(device)
layer1 = nn.Conv1d(20,500,kernel_size=4,stride=4,groups=20,bias=False).to(device)
layer2 = nn.Linear(500,768).to(device)
l1out = layer1(rr)
l2out = layer2(l1out.transpose(1,2))

Dimension breakdown:

Input: torch.float32, shape [2,20,5000]

nn.Conv1d: in_channels 20, out_channels 500, kernel and stride both 4, 20 groups, no bias

nn.Linear: in_channels 500, out_channels 768

Input dimensions are transposed before being processed by the linear layer (so input to linear layer is [2,1250,500])

The error occurs in the linear layer:

File “/hpc/group/collinslab/xc130/.conda/amll/lib/python3.10/site-packages/torch/nn/modules/linear.py”, line 114, in forward
return F.linear(input, self.weight, self.bias)

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

Notes:

  • The code runs fine on CPU, but not when I do the same operations on GPU. It’s happening on different partitions with different GPUs (including NVIDIA RTX A5000 and NVIDIA GeForce RTX 2080 Ti) with 8GB allocated.
  • No error when I directly input a tensor of zeros into the linear layer (as opposed to taking the output of the convolutional layer).
  • No error when I replace the first (convolutional) layer with a linear layer

I cannot reproduce the issue in 1.13.1+cu117 on a 3090, but will try to use a 2080Ti next.
Thanks for sharing the minimal and executable code btw.!

I had initially set up the environment with Python 3.11.1 and downgraded to 3.10.8 - could that have caused an issue with the cudatoolkit? Otherwise, do you know what issues can result in this error?

I doubt the Python version would interact with the nvidia pip wheels as these are shipping libraries and do not depend on any Python feature.
I tried to reproduce the issue on a 2080Ti and it still works for me in a new environment:

>>> import torch
>>> torch.__version__
'1.13.1+cu117'
>>> device = "cuda:1"
>>> print(torch.cuda.get_device_properties(device))
_CudaDeviceProperties(name='NVIDIA GeForce RTX 2080 Ti', major=7, minor=5, total_memory=12028MB, multi_processor_count=70)
>>> import torch.nn as nn
>>> rr = torch.zeros([2,20,5000]).to(device)
>>> layer1 = nn.Conv1d(20,500,kernel_size=4,stride=4,groups=20,bias=False).to(device)
>>> layer2 = nn.Linear(500,768).to(device)
>>> l1out = layer1(rr)
>>> l2out = layer2(l1out.transpose(1,2))
>>> l2out
tensor([[[-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         ...,
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031]],

        [[-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         ...,
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031],
         [-0.0075,  0.0157,  0.0326,  ...,  0.0028, -0.0014, -0.0031]]],
       device='cuda:1', grad_fn=<AddBackward0>)