Matrix inversion fails on GPU (google Colab)

I’m having trouble performing matrix inversion on the GPU - on a matrix that inverts fine on the CPU. I am using Google Colab with torch version 1.3.0+cu100. Here is my code:

import torch
dim = 100
# CPU inversion
A = torch.rand(dim,dim,device='cpu')
Ainv = A.inverse()
print(torch.matmul(A,Ainv))

# GPU inversion
A = A.to('cuda')
Ainv = A.inverse()
print(torch.matmul(A,Ainv))

For a small matrix (i.e. setting dim = 100), I get the following error:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

For a large matrix (i.e. setting dim = 1000), I get the following error:
RuntimeError: inverse_cuda: U(1,1) is zero, singular U.

In both cases, the inversion goes fine on the CPU, but inverting the same matrix on the GPU fails. Any help is appreciated!

Edit: Running the above code on another workstation with torch version 1.0.1.post2 does not produce this error.

Does the error happen during the inverse or during the matmul in the print?

When dim=100, it fails on matmul and we get the cublas error. When dim=1000, it fails on the inversion step, and we get the singular U error.

Reverting to a previous version of pytorch fixes the errors, which we can do in Colab with:

!pip install torch==1.0.0 torchvision==0.2.1

Hi,

It might be related to some magma updates.
You can see the progress on this issue: https://github.com/pytorch/hub/issues/62

I have the same issue with PyTorch.

Downgrading to the torch==1.0.0 torchvision==0.2.1 did not work for me. The same error still persists.

Hi @albanD this is really strange. Since the code that OP has posted works for me with no error

$ python test_cublas.py 
tensor([[ 1.0000e+00,  4.2183e-06,  6.3342e-07,  ..., -2.5928e-06,
         -8.4937e-07, -4.7684e-06],
        [-1.1551e-06,  1.0000e+00,  3.6079e-07,  ..., -1.7285e-06,
         -1.6391e-07, -5.0068e-06],
        [-5.3546e-07,  3.8960e-06,  1.0000e+00,  ..., -1.8477e-06,
         -6.8545e-07, -4.8876e-06],
        ...,
        [-9.6698e-07,  2.2674e-06, -2.1878e-07,  ...,  1.0000e+00,
         -2.9802e-08, -2.7418e-06],
        [ 9.3132e-07,  3.5167e-06,  1.7881e-07,  ..., -2.9802e-06,
          1.0000e+00, -4.6492e-06],
        [-2.9802e-08,  5.2452e-06,  7.1526e-07,  ..., -2.7716e-06,
         -8.1770e-07,  9.9999e-01]])
tensor([[ 1.0000e+00, -3.4571e-06,  6.5565e-07,  ..., -2.5034e-06,
          5.9605e-07,  4.7684e-07],
        [ 1.0490e-05,  1.0000e+00,  9.5367e-07,  ..., -6.6757e-06,
         -4.7684e-06, -9.5367e-06],
        [ 0.0000e+00, -2.9802e-06,  1.0000e+00,  ..., -2.3842e-07,
         -2.7418e-06, -1.0490e-05],
        ...,
        [-2.3842e-07,  2.5034e-06, -2.9802e-07,  ...,  1.0000e+00,
          2.5332e-06,  7.8678e-06],
        [ 3.8147e-06, -1.4305e-06,  3.5763e-07,  ..., -1.9073e-06,
          1.0000e+00,  1.9073e-06],
        [ 7.1526e-06, -1.6689e-06, -8.3447e-07,  ...,  0.0000e+00,
         -3.0994e-06,  1.0000e+00]], device='cuda:0')

However, when I run a repo’s code, I get the same exact error:
(same if I run with or without CUDA_LAUNCH_BLOCKING=1)

$ CUDA_LAUNCH_BLOCKING=1 python demo.py --filename input/easy_bat.jpg --class_name bat
2021-03-26 18:06:07,542 INFO     Calling with args: Namespace(class_name='bat', filename='input/easy_bat.jpg', lw_collision=None, lw_depth=None, lw_inter=None, lw_inter_part=None, lw_scale=None, lw_scale_person=None, lw_sil=None, mesh_index=0, output_dir='output')
2021-03-26 18:06:10,955 INFO     Loading checkpoint from detectron2://PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 18:06:10,962 INFO     URL https://dl.fbaipublicfiles.com/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl cached in /home/grad3/jalal/.torch/fvcore_cache/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 18:06:11,069 INFO     Reading a file from 'Detectron2 Model Zoo'
WARNING: You are using a SMPL model, with only 10 shape coefficients.
class_name:  bat
  0%|                                                                                                    | 0/800.0 [00:00<?, ?it/s]Traceback (most recent call last):
  File "demo.py", line 145, in <module>
    main(get_args())
  File "demo.py", line 121, in main
    instances=instances, class_name=args.class_name, mesh_index=args.mesh_index
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 406, in find_optimal_poses
    num_initializations=num_initializations,
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 287, in find_optimal_pose
    vertices=torch.matmul(vertices.unsqueeze(0), rotations_init),
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
  0%|                                                                                                    | 0/800.0 [00:00<?, ?it/s]
Segmentation fault