RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Hello all,

I am using pytorch ‘1.13.0+cu117’, my env is NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0

in the terminal of python, I tried the very simple example:

>>> import torch
>>> x=torch.ones(2,2,1).to('cuda')
>>> y=torch.ones(2,1,2).to('cuda')
>>> x
tensor([[[1.],
         [1.]],

        [[1.],
         [1.]]], device='cuda:0')
>>> y
tensor([[[1., 1.]],

        [[1., 1.]]], device='cuda:0')
>>> y@x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
>>> torch.bmm(y,x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
>>> torch.matmul(y,x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
>>> x=torch.ones(2,1).to('cuda')
>>> y=torch.ones(1,2).to('cuda')
>>> y@x
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
>>> torch.mm(y,x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
>>> torch.mm(x,y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
>>>

The issues are obviously not caused by the mismatch size. Anyone has any idea? thanks!