Vertices=torch.matmul(vertices.unsqueeze(0), rotations_init), RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched in CentOS

someone ran Matrix inversion fails on GPU (google Colab) and had same exact problem as me but I don’t have a problem running this
I ran this and there was no problem

import torch
dim = 100
# CPU inversion
A = torch.rand(dim,dim,device='cpu')
Ainv = A.inverse()
print(torch.matmul(A,Ainv))

# GPU inversion
A = A.to('cuda')
Ainv = A.inverse()
print(torch.matmul(A,Ainv))

result was

tensor([[ 1.0000e+00,  6.5939e-06,  1.2953e-06,  ...,  5.2452e-06,
         -5.4836e-06,  1.6689e-06],
        [-2.6617e-06,  1.0000e+00,  1.5359e-06,  ...,  7.6294e-06,
          2.3842e-07,  2.0266e-06],
        [ 6.7785e-06,  6.1743e-06,  1.0000e+00,  ...,  7.1526e-06,
         -1.5497e-06, -4.7684e-07],
        ...,
        [-3.3288e-06,  1.0316e-06,  5.7282e-07,  ...,  1.0000e+00,
          1.1921e-06, -2.3842e-07],
        [ 7.4506e-07,  1.9073e-06,  7.1526e-07,  ...,  3.3379e-06,
          1.0000e+00,  2.8610e-06],
        [ 9.2387e-07, -1.0252e-05, -8.6427e-07,  ..., -4.5300e-06,
          7.0781e-06,  1.0000e+00]])
tensor([[ 1.0000e+00,  2.3842e-07,  5.9605e-08,  ...,  2.3842e-07,
          0.0000e+00,  1.1921e-07],
        [ 5.9605e-08,  1.0000e+00, -3.5763e-07,  ...,  2.3842e-07,
         -5.9605e-08, -2.3842e-07],
        [-2.8312e-07,  2.7418e-06,  1.0000e+00,  ...,  1.7881e-07,
          2.9802e-07,  1.7881e-07],
        ...,
        [ 2.6822e-07,  2.3842e-07,  2.3842e-07,  ...,  1.0000e+00,
          4.4703e-07,  6.5565e-07],
        [ 5.9605e-08,  2.3842e-07,  3.5763e-07,  ...,  0.0000e+00,
          1.0000e+00, -2.3842e-07],
        [ 1.1921e-07,  1.7881e-06, -1.1921e-07,  ..., -5.9605e-07,
          4.4703e-07,  1.0000e+00]], device='cuda:0')