RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

pt.megamozg80 · December 16, 2022, 1:31pm

It’s very strange, but when uninstall nvidia-cublas-cu11 that code starts working

ptrblck · December 16, 2022, 6:09pm

Could you post the log from the pip uninstall command to show which version you’ve exactly removed?
Since your code is now working I would guess your setup had multiple cublas libs installed.

pt.megamozg80 · December 16, 2022, 6:54pm

I made some changes via simple installation of torch without version specifying and enter python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.8.11 (default, Sep  1 2021, 12:33:46)  [GCC 9.3.1 20200408 (Red Hat 9.3.1-2)] (64-bit runtime)
Python platform: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.4.120
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.2.4
/usr/lib64/libcudnn_adv_infer.so.8.2.4
/usr/lib64/libcudnn_adv_train.so.8.2.4
/usr/lib64/libcudnn_cnn_infer.so.8.2.4
/usr/lib64/libcudnn_cnn_train.so.8.2.4
/usr/lib64/libcudnn_ops_infer.so.8.2.4
/usr/lib64/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.4
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1
[pip3] torchcam==0.3.2
[pip3] torchvision==0.14.1
[conda] Could not collect

My code with affine grid still wasn’t working

Then after uninstalling nvidia-cublas the outout was Successfully uninstalled nvidia-cublas-cu11-11.10.3.66

ptrblck · December 17, 2022, 6:55am

This seems wrong:

PyTorch version: 1.13.1+cu117
CUDA used to build PyTorch: 11.7
...
CUDA runtime version: 11.4.120

since it points to a mismatch between the CUDA version used to build the wheels and which the pip install torch command would install (CUDA11.7 with all needed dependencies) vs. what is detected as your CUDA runtime (11.4).
Did you install other Python packages depending on CUDA11.4?

pt.megamozg80 · December 18, 2022, 9:07am

I don’t know for sure which of my packages require CUDA11.4, I can assume that torchvision==0.14.0 depends on CUDA11.4

ptrblck · December 18, 2022, 9:54pm

Why would you assume torchvision depends on CUDA 11.4. Did you see anything pointing towards this dependency in your install logs or during the runtime or is this pure guessing?

pt.megamozg80 · December 19, 2022, 10:26am

It’s a pure guess, and it seems silly now.
I didn’t install any other packages depending on CUDA11.4
Also after uninstaling, the otput of collecting environment didn’t change

Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.8.11 (default, Sep  1 2021, 12:33:46)  [GCC 9.3.1 20200408 (Red Hat 9.3.1-2)] (64-bit runtime)
Python platform: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.4.120
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.2.4
/usr/lib64/libcudnn_adv_infer.so.8.2.4
/usr/lib64/libcudnn_adv_train.so.8.2.4
/usr/lib64/libcudnn_cnn_infer.so.8.2.4
/usr/lib64/libcudnn_cnn_train.so.8.2.4
/usr/lib64/libcudnn_ops_infer.so.8.2.4
/usr/lib64/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.4
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1
[pip3] torchcam==0.3.2
[pip3] torchvision==0.14.1
[conda] Could not collect

Brian_Horakh · December 21, 2022, 10:38am

same error

python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.0.76
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.0
[pip3] torch==1.13.1
[conda] cudatoolkit               11.6.0              hecad31d_10    conda-forge
[conda] numpy                     1.21.0                   pypi_0    pypi
[conda] torch                     1.12.0                   pypi_0    pypi
[conda] torchaudio                0.12.0+cu116             pypi_0    pypi
[conda] torchtext                 0.13.0                   pypi_0    pypi
[conda] torchvision               0.13.0+cu116             pypi_0    pypi

FRK · December 27, 2022, 5:00pm

I had same issues as the above logs and figured with above pointers that the issue was mismatch b.w CUDA runtime and pytorch biuld.

I think this happened (everything was running smoothly a day back) because I did update of ubuntu packages.

In order to resolve:

I just purged nvidia drivers and re-installed 11.7 CUDA and CUDNN.
I did not re-install pytorch.

rohitg · January 2, 2023, 8:14pm

Faced the same issue. There was no mismatch in tensor shapes; had to fix the pytorch CUDA version (11.7) to be compatible with my system’s CUDA 11.6 (ended up downloading PyTorch 1.13cu11.6)

cthorrez · February 8, 2023, 7:19pm

I had the same error and the root cause for me was also a mismatch in pytorch cuda and system cuda. I used torch 1.13.1 with cuda 11.6 and then it worked. (my docker image has cuda 11.6)

Jiabin_Li · May 30, 2023, 1:39am

I solved this error of torch.matmul, use python -m torch.utils.collect_env I find the cudnn is different path of the cuda. The new torch version after 1.2 very strict for cuda and cudnn version. I correct it, solved this problem

jasjaf · October 4, 2023, 8:39pm

i was able to resolve this by adding the conda enviornment library to my LD_LIBRARY_PATH variable:

export LD_LIBRARY_PATH=/home/$USER/.conda/envs/$ENVNAME/lib:/usr/local/cuda-11/lib64

Mona_Jalal · February 7, 2024, 6:49pm

I get this error - Please let me know if you may have any suggestions?

(gdrnpp) mona@ada:~/gdrnpp_bop2022$ ./det/yolox/tools/test_yolox.sh ./configs/yolox/bop_pbr/yolox_x_640_augCozyAAEhsv_ranger_30_epochs_mona_bop_test.py 0 ./output/yolox/bop_pbr/yolox_x_640_augCozyAAEhsv_ranger_30_epochs_mona_bop_test/model_final.pth



_module.pnp_net.features.0.weight
  _module.pnp_net.features.1.{bias, weight}
  _module.pnp_net.features.3.weight
  _module.pnp_net.features.4.{bias, weight}
  _module.pnp_net.features.6.weight
  _module.pnp_net.features.7.{bias, weight}
  _module.pnp_net.fc1.{bias, weight}
  _module.pnp_net.fc2.{bias, weight}
  _module.pnp_net.fc_r.{bias, weight}
  _module.pnp_net.fc_t.{bias, weight}
[0207_134552 detectron2@57]: 	Fusing conv bn...
ERROR [0207_134553 d2.engine.launch@82]: An error has been caught in function 'launch', process 'MainProcess' (839409), thread 'MainThread' (140334550648640):
Traceback (most recent call last):

  File "/home/mona/gdrnpp_bop2022/./det/yolox/tools/main_yolox.py", line 70, in <module>
    launch(
    -> <function launch at 0x7fa2132cac10>

> File "/home/mona/anaconda3/envs/gdrnpp/lib/python3.9/site-packages/detectron2/engine/launch.py", line 82, in launch
    main_func(*args)
    |          -> (Namespace(config_file='./configs/yolox/bop_pbr/yolox_x_640_augCozyAAEhsv_ranger_30_epochs_mona_bop_test.py', resume=False,...
    -> <function main at 0x7fa0a6bc98b0>

  File "/home/mona/gdrnpp_bop2022/./det/yolox/tools/main_yolox.py", line 58, in main
    model = fuse_model(model)
            |          -> YOLOX(
            |               (backbone): YOLOPAFPN(
            |                 (backbone): CSPDarknet(
            |                   (stem): Focus(
            |                     (conv): BaseConv(
            |                       (conv): ...
            -> <function fuse_model at 0x7fa113d48310>

  File "/home/mona/gdrnpp_bop2022/det/yolox/tools/../../../det/yolox/utils/model_utils.py", line 67, in fuse_model
    m.conv = fuse_conv_and_bn(m.conv, m.bn)  # update conv
    |        |                |       -> BaseConv(
    |        |                |            (conv): Conv2d(12, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    |        |                |            (bn): BatchNorm2d(80, eps...
    |        |                -> BaseConv(
    |        |                     (conv): Conv2d(12, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    |        |                     (bn): BatchNorm2d(80, eps...
    |        -> <function fuse_conv_and_bn at 0x7fa113d48280>
    -> BaseConv(
         (conv): Conv2d(12, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
         (bn): BatchNorm2d(80, eps...

  File "/home/mona/gdrnpp_bop2022/det/yolox/tools/../../../det/yolox/utils/model_utils.py", line 57, in fuse_conv_and_bn
    fusedconv.bias.copy_(torch.mm(w_bn, b_conv.reshape(-1, 1)).reshape(-1) + b_bn)
    |                    |     |  |     |      |                             -> tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
    |                    |     |  |     |      |                                        0., 0., 0., 0...
    |                    |     |  |     |      -> <method 'reshape' of 'torch._C._TensorBase' objects>
    |                    |     |  |     -> tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
    |                    |     |  |                0., 0., 0., 0...
    |                    |     |  -> tensor([[0.9995, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
    |                    |     |             [0.0000, 0.9995, 0.0000,  ..., 0.0000, 0.0000, 0.0000...
    |                    |     -> <built-in method mm of type object at 0x7fa1e1695ee0>
    |                    -> <module 'torch' from '/home/mona/anaconda3/envs/gdrnpp/lib/python3.9/site-packages/torch/__init__.py'>
    -> Conv2d(12, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

and I have:

(gdrnpp) mona@ada:~/gdrnpp_bop2022$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.10.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.9.18 (main, Sep 11 2023, 13:41:44)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
GPU models and configuration: GPU 0: NVIDIA RTX 6000 Ada Generation
Nvidia driver version: 535.104.12
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.3
[pip3] pytorch-lightning==1.6.0
[pip3] torch==1.10.1
[pip3] torchaudio==0.10.1
[pip3] torchmetrics==1.3.0.post0
[pip3] torchvision==0.11.2
[conda] blas                      1.0                         mkl    conda-forge
[conda] cudatoolkit               11.3.1              hb98b00a_12    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] mypy-extensions           1.0.0                    pypi_0    pypi
[conda] numpy                     1.26.3           py39h474f0d3_0    conda-forge
[conda] pytorch                   1.10.1          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-lightning         1.6.0                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     2.2.0                    pypi_0    pypi
[conda] torchaudio                0.10.1               py39_cu113    pytorch
[conda] torchmetrics              1.3.0.post0              pypi_0    pypi
[conda] torchvision               0.11.2               py39_cu113    pytorch

I am using this repo

ptrblck · February 7, 2024, 7:07pm

Could you post a minimal and executable code snippet reproducing the issue, please?

Mona_Jalal · February 19, 2024, 4:34pm

Sorry for the delayed response.

Here’s the code and error again (I am using this repo GitHub - jabarragann/gdrnpp_bop2022 at 21d103da8716755f6e3c73a9e127d7efd3852eed along with recommended setup here: gdrnpp_bop2022/JuanInstallation.md at 21d103da8716755f6e3c73a9e127d7efd3852eed · jabarragann/gdrnpp_bop2022 · GitHub


(juan-gdrnpp) mona@ada:~/juan/gdrnpp_bop2022$ ./core/gdrn_modeling/train_gdrn.sh  configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_suturing.py  0

20240219_111807|core.utils.my_writer@198: eta: 0:00:32  epoch: 1 iter: 90/130[69.2%] time: 0.8363 lr: 7.2728e-05 max_mem: 25740M  total_loss: 249.4 (243.8)  loss_coor_x: 0.1796 (0.2404)  loss_coor_y: 0.1663 (0.4394)  loss_coor_z: 0.2217 (0.309)  loss_mask: 0.02429 (0.03835)  loss_mask_full: 0.03292 (0.06163)  loss_region: 248.3 (242.2)  loss_PM_R: 0.01011 (0.01141)  loss_centroid: 0.3883 (0.3848)  loss_z: 0.118 (0.1263) 
epoch_str:  epoch: 1 
20240219_111808|core.utils.my_writer@198: eta: 0:00:31  epoch: 1 iter: 91/130[70.0%] time: 0.8361 lr: 7.3527e-05 max_mem: 25740M  total_loss: 249.4 (243.7)  loss_coor_x: 0.1713 (0.2394)  loss_coor_y: 0.1659 (0.4363)  loss_coor_z: 0.2217 (0.3085)  loss_mask: 0.02429 (0.03822)  loss_mask_full: 0.03357 (0.06136)  loss_region: 248.3 (242.1)  loss_PM_R: 0.01011 (0.01141)  loss_centroid: 0.3883 (0.3849)  loss_z: 0.1173 (0.1261) 
epoch_str:  epoch: 1 
20240219_111809|core.utils.my_writer@198: eta: 0:00:30  epoch: 1 iter: 92/130[70.8%] time: 0.8357 lr: 7.4326e-05 max_mem: 25740M  total_loss: 249.4 (243.9)  loss_coor_x: 0.1643 (0.2384)  loss_coor_y: 0.1628 (0.4332)  loss_coor_z: 0.2217 (0.3074)  loss_mask: 0.02429 (0.03803)  loss_mask_full: 0.03421 (0.06108)  loss_region: 248.3 (242.3)  loss_PM_R: 0.01011 (0.01141)  loss_centroid: 0.3883 (0.3846)  loss_z: 0.1166 (0.1259) 
epoch_str:  epoch: 1 
20240219_111810|core.utils.my_writer@198: eta: 0:00:30  epoch: 1 iter: 93/130[71.5%] time: 0.8355 lr: 7.5126e-05 max_mem: 25740M  total_loss: 250.1 (244.1)  loss_coor_x: 0.1634 (0.2375)  loss_coor_y: 0.1598 (0.4302)  loss_coor_z: 0.2192 (0.3063)  loss_mask: 0.02379 (0.03787)  loss_mask_full: 0.03421 (0.06076)  loss_region: 249 (242.5)  loss_PM_R: 0.01011 (0.0114)  loss_centroid: 0.3886 (0.3848)  loss_z: 0.1166 (0.1256) 
epoch_str:  epoch: 1 
20240219_111810|core.utils.my_writer@198: eta: 0:00:29  epoch: 1 iter: 94/130[72.3%] time: 0.8351 lr: 7.5925e-05 max_mem: 25740M  total_loss: 250.1 (244)  loss_coor_x: 0.1626 (0.2366)  loss_coor_y: 0.1591 (0.4273)  loss_coor_z: 0.2182 (0.3051)  loss_mask: 0.02426 (0.03773)  loss_mask_full: 0.03421 (0.06046)  loss_region: 249 (242.4)  loss_PM_R: 0.01008 (0.01138)  loss_centroid: 0.3904 (0.3849)  loss_z: 0.1165 (0.1254) 
epoch_str:  epoch: 1 
20240219_111811|core.utils.my_writer@198: eta: 0:00:28  epoch: 1 iter: 95/130[73.1%] time: 0.8349 lr: 7.6724e-05 max_mem: 25740M  total_loss: 250.1 (243.9)  loss_coor_x: 0.1616 (0.2358)  loss_coor_y: 0.1575 (0.4244)  loss_coor_z: 0.2182 (0.3043)  loss_mask: 0.02426 (0.03763)  loss_mask_full: 0.03421 (0.06016)  loss_region: 249 (242.3)  loss_PM_R: 0.01019 (0.01138)  loss_centroid: 0.3904 (0.3847)  loss_z: 0.1154 (0.1251) 
epoch_str:  epoch: 1 
20240219_111812|core.utils.my_writer@198: eta: 0:00:27  epoch: 1 iter: 96/130[73.8%] time: 0.8348 lr: 7.7523e-05 max_mem: 25740M  total_loss: 249.4 (243.6)  loss_coor_x: 0.1596 (0.2347)  loss_coor_y: 0.1564 (0.4215)  loss_coor_z: 0.2182 (0.3028)  loss_mask: 0.02357 (0.03747)  loss_mask_full: 0.03471 (0.05991)  loss_region: 248.3 (242)  loss_PM_R: 0.01019 (0.01136)  loss_centroid: 0.3886 (0.3847)  loss_z: 0.1142 (0.1248) 
epoch_str:  epoch: 1 
20240219_111813|core.utils.my_writer@198: eta: 0:00:26  epoch: 1 iter: 97/130[74.6%] time: 0.8347 lr: 7.8322e-05 max_mem: 25740M  total_loss: 249.4 (243.7)  loss_coor_x: 0.1565 (0.2338)  loss_coor_y: 0.1562 (0.4188)  loss_coor_z: 0.216 (0.3019)  loss_mask: 0.02357 (0.03731)  loss_mask_full: 0.03471 (0.05962)  loss_region: 248.3 (242.2)  loss_PM_R: 0.01019 (0.01136)  loss_centroid: 0.3883 (0.3844)  loss_z: 0.1133 (0.1245) 
epoch_str:  epoch: 1 
20240219_111814|core.utils.my_writer@198: eta: 0:00:25  epoch: 1 iter: 98/130[75.4%] time: 0.8345 lr: 7.9122e-05 max_mem: 25740M  total_loss: 249.4 (243.6)  loss_coor_x: 0.1542 (0.2328)  loss_coor_y: 0.1556 (0.4159)  loss_coor_z: 0.216 (0.3011)  loss_mask: 0.02309 (0.03715)  loss_mask_full: 0.03443 (0.05937)  loss_region: 248.3 (242.1)  loss_PM_R: 0.01019 (0.01135)  loss_centroid: 0.3886 (0.3846)  loss_z: 0.1108 (0.1242) 
epoch_str:  epoch: 1 
20240219_111815|core.utils.my_writer@198: eta: 0:00:25  epoch: 1 iter: 99/130[76.2%] time: 0.8344 lr: 7.9921e-05 max_mem: 25740M  total_loss: 250.1 (243.8)  loss_coor_x: 0.153 (0.2318)  loss_coor_y: 0.1556 (0.4133)  loss_coor_z: 0.2154 (0.3)  loss_mask: 0.02309 (0.03698)  loss_mask_full: 0.03401 (0.05911)  loss_region: 249 (242.2)  loss_PM_R: 0.01023 (0.01134)  loss_centroid: 0.3886 (0.385)  loss_z: 0.1089 (0.1239) 
epoch_str:  epoch: 1 
20240219_111839|core.utils.my_writer@198: eta: 0:00:00  epoch: 1 iter: 129/130[99.2%] time: 0.8304 lr: 0.0001039 max_mem: 25740M  total_loss: 232.6 (243.4)  loss_coor_x: 0.1349 (0.2106)  loss_coor_y: 0.1493 (0.352)  loss_coor_z: 0.2137 (0.2825)  loss_mask: 0.02592 (0.03421)  loss_mask_full: 0.03323 (0.05327)  loss_region: 231.7 (241.9)  loss_PM_R: 0.01137 (0.01129)  loss_centroid: 0.3711 (0.3828)  loss_z: 0.02189 (0.104) 
20240219_111839|fvcore.common.checkpoint@124: Saving checkpoint to output/gdrn/ambf_suturing/classAware_ambf_suturing_v0.0.2/model_final.pth
20240219_111843|core.gdrn_modeling.datasets.ambf_suturing@100: load cached dataset dicts from /home/mona/juan/gdrnpp_bop2022/.cache/dataset_dicts_ambf_suturing_test_fbe937aad3f79390e82b4f4b5d9b093d.pkl
20240219_111843|core.gdrn_modeling.datasets.data_loader@190: Serializing 1504 elements to byte tensors and concatenating them all ...
20240219_111843|core.gdrn_modeling.datasets.data_loader@195: Serialized dataset takes 2.43 MiB
20240219_111843|core.gdrn_modeling.engine.gdrn_evaluator@689: Start inference on 1504 images
20240219_111844|ERR|__main__@233: An error has been caught in function '<module>', process 'MainProcess' (461099), thread 'MainThread' (140493061670720):
Traceback (most recent call last):

> File "/home/mona/juan/gdrnpp_bop2022/./core/gdrn_modeling/main_gdrn.py", line 233, in <module>
    main(args)
    │    └ Namespace(config_file='configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_su...
    └ <function main at 0x7fc57ed4d820>

  File "/home/mona/juan/gdrnpp_bop2022/./core/gdrn_modeling/main_gdrn.py", line 199, in main
    Lite(
    └ <class '__main__.Lite'>

  File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/pytorch_lightning/lite/lite.py", line 408, in _run_impl
    return run_method(*args, **kwargs)
           │           │       └ {}
           │           └ (Namespace(config_file='configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_s...
           └ functools.partial(<bound method LightningLite._run_with_strategy_setup of <__main__.Lite object at 0x7fc716c83130>>, <bound m...
  File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/pytorch_lightning/lite/lite.py", line 413, in _run_with_strategy_setup
    return run_method(*args, **kwargs)
           │           │       └ {}
           │           └ (Namespace(config_file='configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_s...
           └ <bound method Lite.run of <__main__.Lite object at 0x7fc716c83130>>

  File "/home/mona/juan/gdrnpp_bop2022/./core/gdrn_modeling/main_gdrn.py", line 189, in run
    return self.do_test(cfg, model)
           │    │       │    └ _LiteModule(
           │    │       │        (_module): GDRN_DoubleMask(
           │    │       │          (backbone): FeatureListNet(
           │    │       │            (stem_0): Conv2d(3, 128, kernel_size=(4, 4),...
           │    │       └ Config (path: configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_suturing.py...
           │    └ <function GDRN_Lite.do_test at 0x7fc5865f0dc0>
           └ <__main__.Lite object at 0x7fc716c83130>

  File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/engine/engine.py", line 159, in do_test
    results_i = gdrn_inference_on_dataset(cfg, model, data_loader, evaluator, amp_test=cfg.TEST.AMP_TEST)
                │                         │    │      │            │                   └ Config (path: configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_suturing.py...
                │                         │    │      │            └ <core.gdrn_modeling.engine.gdrn_evaluator.GDRN_Evaluator object at 0x7fc5dcb72dc0>
                │                         │    │      └ <pytorch_lightning.lite.wrappers._LiteDataLoader object at 0x7fc5dc6c0850>
                │                         │    └ _LiteModule(
                │                         │        (_module): GDRN_DoubleMask(
                │                         │          (backbone): FeatureListNet(
                │                         │            (stem_0): Conv2d(3, 128, kernel_size=(4, 4),...
                │                         └ Config (path: configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_suturing.py...
                └ <function gdrn_inference_on_dataset at 0x7fc58689b670>

  File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/engine/gdrn_evaluator.py", line 737, in gdrn_inference_on_dataset
    out_dict = model(
               └ _LiteModule(
                   (_module): GDRN_DoubleMask(
                     (backbone): FeatureListNet(
                       (stem_0): Conv2d(3, 128, kernel_size=(4, 4),...

  File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
           │             │        └ {'roi_classes': tensor([0], device='cuda:0'), 'roi_cams': tensor([[[350.8070,   0.0000, 320.0000],
           │             │                   [  0.0000, 350.80...
           │             └ (tensor([[[[0.5490, 0.5490, 0.5490,  ..., 0.6941, 0.6941, 0.6902],
           │                         [0.5490, 0.5490, 0.5490,  ..., 0.6824, 0.6824, 0...
           └ <bound method _LiteModule.forward of _LiteModule(
               (_module): GDRN_DoubleMask(
                 (backbone): FeatureListNet(
                   (stem_0...
  File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/pytorch_lightning/lite/wrappers.py", line 105, in forward
    output = self.module(*args, **kwargs)
             │    │       │       └ {'roi_classes': tensor([0], device='cuda:0'), 'roi_cams': tensor([[[350.8070,   0.0000, 320.0000],
             │    │       │                  [  0.0000, 350.80...
             │    │       └ (tensor([[[[0.5490, 0.5490, 0.5490,  ..., 0.6941, 0.6941, 0.6902],
             │    │                   [0.5490, 0.5490, 0.5490,  ..., 0.6824, 0.6824, 0...
             │    └ <property object at 0x7fc5f0c624f0>
             └ _LiteModule(
                 (_module): GDRN_DoubleMask(
                   (backbone): FeatureListNet(
                     (stem_0): Conv2d(3, 128, kernel_size=(4, 4),...
  File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
           │             │        └ {'roi_classes': tensor([0], device='cuda:0'), 'roi_cams': tensor([[[350.8070,   0.0000, 320.0000],
           │             │                   [  0.0000, 350.80...
           │             └ (tensor([[[[0.5490, 0.5490, 0.5490,  ..., 0.6941, 0.6941, 0.6902],
           │                         [0.5490, 0.5490, 0.5490,  ..., 0.6824, 0.6824, 0...
           └ <bound method GDRN_DoubleMask.forward of GDRN_DoubleMask(
               (backbone): FeatureListNet(
                 (stem_0): Conv2d(3, 128, kernel_s...

  File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/models/GDRN_double_mask.py", line 158, in forward
    pred_rot_, pred_t_ = self.pnp_net(
                         └ GDRN_DoubleMask(
                             (backbone): FeatureListNet(
                               (stem_0): Conv2d(3, 128, kernel_size=(4, 4), stride=(4, 4))
                               (stem_1): ...

  File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
           │             │        └ {'region': tensor([[[[0.0157, 0.0155, 0.0153,  ..., 0.0153, 0.0153, 0.0156],
           │             │                    [0.0155, 0.0152, 0.0151,  ..., 0.0150,...
           │             └ (tensor([[[[-8.9684e-03, -8.9233e-03, -8.8213e-03,  ..., -9.3900e-03,
           │                          -9.4768e-03, -9.1552e-03],
           │                         [-8.993...
           └ <bound method ConvPnPNet.forward of ConvPnPNet(
               (act): GELU()
               (dropblock): LinearScheduler(
                 (dropblock): DropBlock2D(...

  File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/models/heads/conv_pnp_net.py", line 178, in forward
    x = self.act(self.fc1(flat_conv_feat))
        │        │        └ tensor([[ 0.7126,  0.0505,  0.0421,  ..., -0.0887, -0.0852, -0.0815]],
        │        │                 device='cuda:0')
        │        └ ConvPnPNet(
        │            (act): GELU()
        │            (dropblock): LinearScheduler(
        │              (dropblock): DropBlock2D()
        │            )
        │            (features): ModuleList(
        │              ...
        └ ConvPnPNet(
            (act): GELU()
            (dropblock): LinearScheduler(
              (dropblock): DropBlock2D()
            )
            (features): ModuleList(
              ...

  File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
           │             │        └ {}
           │             └ (tensor([[ 0.7126,  0.0505,  0.0421,  ..., -0.0887, -0.0852, -0.0815]],
           │                      device='cuda:0'),)
           └ <bound method Linear.forward of Linear(in_features=8192, out_features=1024, bias=True)>
  File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
           │ │      │      │            └ Linear(in_features=8192, out_features=1024, bias=True)
           │ │      │      └ Linear(in_features=8192, out_features=1024, bias=True)
           │ │      └ tensor([[ 0.7126,  0.0505,  0.0421,  ..., -0.0887, -0.0852, -0.0815]],
           │ │               device='cuda:0')
           │ └ <function linear at 0x7fc6c9282160>
           └ <module 'torch.nn.functional' from '/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/functional.py'>
  File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
           │     │  │   │      │      │       └ Parameter containing:
           │     │  │   │      │      │         tensor([-6.3990e-05, -1.7071e-04,  8.4036e-05,  ...,  4.6594e-04,
           │     │  │   │      │      │                 -1.6388e-04, -2.0591e-05], de...
           │     │  │   │      │      └ Parameter containing:
           │     │  │   │      │        tensor([[ 9.1223e-04, -6.6741e-04, -8.0336e-04,  ..., -1.6803e-03,
           │     │  │   │      │                  2.4945e-03,  2.0192e-03],
           │     │  │   │      │        ...
           │     │  │   │      └ tensor([[ 0.7126,  0.0505,  0.0421,  ..., -0.0887, -0.0852, -0.0815]],
           │     │  │   │               device='cuda:0')
           │     │  │   └ <built-in function linear>
           │     │  └ <module 'torch._C._nn'>
           │     └ <module 'torch._C' from '/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/_C.cpython-39-x86_64-linux-g...
           └ <module 'torch' from '/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/__init__.py'>

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I am running the training for only 1 epoch so I can debug it.

The code:

(juan-gdrnpp) mona@ada:~/juan/gdrnpp_bop2022$ gedit /home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/engine/gdrn_evaluator.py

File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/engine/gdrn_evaluator.py", line 737, in gdrn_inference_on_dataset
^ based on this error.

github.com

jabarragann/gdrnpp_bop2022/blob/main/core/gdrn_modeling/engine/gdrn_evaluator.py

# -*- coding: utf-8 -*-
"""inference on dataset; save results; evaluate with bop_toolkit (if gt is
available)"""
import datetime
import itertools
import logging
import os.path as osp
import random
import time
from collections import OrderedDict

import cv2
import mmcv
import numpy as np
import ref
import torch
from torch.cuda.amp import autocast
from transforms3d.quaternions import quat2mat

from detectron2.data import MetadataCatalog, DatasetCatalog

This file has been truncated. show original

^^ I couldn’t copy the code due to limit error.

timmyvg · March 12, 2024, 1:16pm

Thank you, your solution works for me.