Matrix inversion fails on GPU (google Colab)

I’m having trouble performing matrix inversion on the GPU - on a matrix that inverts fine on the CPU. I am using Google Colab with torch version 1.3.0+cu100. Here is my code:

import torch
dim = 100
# CPU inversion
A = torch.rand(dim,dim,device='cpu')
Ainv = A.inverse()
print(torch.matmul(A,Ainv))

# GPU inversion
A = A.to('cuda')
Ainv = A.inverse()
print(torch.matmul(A,Ainv))

For a small matrix (i.e. setting dim = 100), I get the following error:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

For a large matrix (i.e. setting dim = 1000), I get the following error:
RuntimeError: inverse_cuda: U(1,1) is zero, singular U.

In both cases, the inversion goes fine on the CPU, but inverting the same matrix on the GPU fails. Any help is appreciated!

Edit: Running the above code on another workstation with torch version 1.0.1.post2 does not produce this error.

3 Likes

Does the error happen during the inverse or during the matmul in the print?

When dim=100, it fails on matmul and we get the cublas error. When dim=1000, it fails on the inversion step, and we get the singular U error.

Reverting to a previous version of pytorch fixes the errors, which we can do in Colab with:

!pip install torch==1.0.0 torchvision==0.2.1

Hi,

It might be related to some magma updates.
You can see the progress on this issue: https://github.com/pytorch/hub/issues/62

I have the same issue with PyTorch.

Downgrading to the torch==1.0.0 torchvision==0.2.1 did not work for me. The same error still persists.

Hi @albanD this is really strange. Since the code that OP has posted works for me with no error

$ python test_cublas.py 
tensor([[ 1.0000e+00,  4.2183e-06,  6.3342e-07,  ..., -2.5928e-06,
         -8.4937e-07, -4.7684e-06],
        [-1.1551e-06,  1.0000e+00,  3.6079e-07,  ..., -1.7285e-06,
         -1.6391e-07, -5.0068e-06],
        [-5.3546e-07,  3.8960e-06,  1.0000e+00,  ..., -1.8477e-06,
         -6.8545e-07, -4.8876e-06],
        ...,
        [-9.6698e-07,  2.2674e-06, -2.1878e-07,  ...,  1.0000e+00,
         -2.9802e-08, -2.7418e-06],
        [ 9.3132e-07,  3.5167e-06,  1.7881e-07,  ..., -2.9802e-06,
          1.0000e+00, -4.6492e-06],
        [-2.9802e-08,  5.2452e-06,  7.1526e-07,  ..., -2.7716e-06,
         -8.1770e-07,  9.9999e-01]])
tensor([[ 1.0000e+00, -3.4571e-06,  6.5565e-07,  ..., -2.5034e-06,
          5.9605e-07,  4.7684e-07],
        [ 1.0490e-05,  1.0000e+00,  9.5367e-07,  ..., -6.6757e-06,
         -4.7684e-06, -9.5367e-06],
        [ 0.0000e+00, -2.9802e-06,  1.0000e+00,  ..., -2.3842e-07,
         -2.7418e-06, -1.0490e-05],
        ...,
        [-2.3842e-07,  2.5034e-06, -2.9802e-07,  ...,  1.0000e+00,
          2.5332e-06,  7.8678e-06],
        [ 3.8147e-06, -1.4305e-06,  3.5763e-07,  ..., -1.9073e-06,
          1.0000e+00,  1.9073e-06],
        [ 7.1526e-06, -1.6689e-06, -8.3447e-07,  ...,  0.0000e+00,
         -3.0994e-06,  1.0000e+00]], device='cuda:0')

However, when I run a repo’s code, I get the same exact error:
(same if I run with or without CUDA_LAUNCH_BLOCKING=1)

$ CUDA_LAUNCH_BLOCKING=1 python demo.py --filename input/easy_bat.jpg --class_name bat
2021-03-26 18:06:07,542 INFO     Calling with args: Namespace(class_name='bat', filename='input/easy_bat.jpg', lw_collision=None, lw_depth=None, lw_inter=None, lw_inter_part=None, lw_scale=None, lw_scale_person=None, lw_sil=None, mesh_index=0, output_dir='output')
2021-03-26 18:06:10,955 INFO     Loading checkpoint from detectron2://PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 18:06:10,962 INFO     URL https://dl.fbaipublicfiles.com/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl cached in /home/grad3/jalal/.torch/fvcore_cache/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 18:06:11,069 INFO     Reading a file from 'Detectron2 Model Zoo'
WARNING: You are using a SMPL model, with only 10 shape coefficients.
class_name:  bat
  0%|                                                                                                    | 0/800.0 [00:00<?, ?it/s]Traceback (most recent call last):
  File "demo.py", line 145, in <module>
    main(get_args())
  File "demo.py", line 121, in main
    instances=instances, class_name=args.class_name, mesh_index=args.mesh_index
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 406, in find_optimal_poses
    num_initializations=num_initializations,
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 287, in find_optimal_pose
    vertices=torch.matmul(vertices.unsqueeze(0), rotations_init),
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
  0%|                                                                                                    | 0/800.0 [00:00<?, ?it/s]
Segmentation fault