I recently updated my Python/PyTorch/cudatoolkit to 3.10.8, 1.13.1, and 11.7, respectively, on a computer cluster.
I started seeing the following error (CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
) at a specific layer in a ML model I am running. This error was reproduced when I simplified it to 2 layers (nn.Conv1d and then nn.Linear). Wondering why this is happening when it wasn’t an issue before.
Code:
device = 'cuda:0'
rr = torch.zeros([2,20,5000]).to(device)
layer1 = nn.Conv1d(20,500,kernel_size=4,stride=4,groups=20,bias=False).to(device)
layer2 = nn.Linear(500,768).to(device)
l1out = layer1(rr)
l2out = layer2(l1out.transpose(1,2))
Dimension breakdown:
Input: torch.float32, shape [2,20,5000]
nn.Conv1d: in_channels 20, out_channels 500, kernel and stride both 4, 20 groups, no bias
nn.Linear: in_channels 500, out_channels 768
Input dimensions are transposed before being processed by the linear layer (so input to linear layer is [2,1250,500])
The error occurs in the linear layer:
File “/hpc/group/collinslab/xc130/.conda/amll/lib/python3.10/site-packages/torch/nn/modules/linear.py”, line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling
cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)
Notes:
- The code runs fine on CPU, but not when I do the same operations on GPU. It’s happening on different partitions with different GPUs (including NVIDIA RTX A5000 and NVIDIA GeForce RTX 2080 Ti) with 8GB allocated.
- No error when I directly input a tensor of zeros into the linear layer (as opposed to taking the output of the convolutional layer).
- No error when I replace the first (convolutional) layer with a linear layer