Itβs very strange, but when uninstall nvidia-cublas-cu11
that code starts working
Could you post the log from the pip uninstall
command to show which version youβve exactly removed?
Since your code is now working I would guess your setup had multiple cublas libs installed.
I made some changes via simple installation of torch without version specifying and enter python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17
Python version: 3.8.11 (default, Sep 1 2021, 12:33:46) [GCC 9.3.1 20200408 (Red Hat 9.3.1-2)] (64-bit runtime)
Python platform: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.4.120
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.2.4
/usr/lib64/libcudnn_adv_infer.so.8.2.4
/usr/lib64/libcudnn_adv_train.so.8.2.4
/usr/lib64/libcudnn_cnn_infer.so.8.2.4
/usr/lib64/libcudnn_cnn_train.so.8.2.4
/usr/lib64/libcudnn_ops_infer.so.8.2.4
/usr/lib64/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.23.4
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1
[pip3] torchcam==0.3.2
[pip3] torchvision==0.14.1
[conda] Could not collect
My code with affine grid still wasnβt working
Then after uninstalling nvidia-cublas the outout was Successfully uninstalled nvidia-cublas-cu11-11.10.3.66
This seems wrong:
PyTorch version: 1.13.1+cu117
CUDA used to build PyTorch: 11.7
...
CUDA runtime version: 11.4.120
since it points to a mismatch between the CUDA version used to build the wheels and which the pip install torch
command would install (CUDA11.7 with all needed dependencies) vs. what is detected as your CUDA runtime (11.4).
Did you install other Python packages depending on CUDA11.4?
I donβt know for sure which of my packages require CUDA11.4, I can assume that torchvision==0.14.0 depends on CUDA11.4
Why would you assume torchvision
depends on CUDA 11.4. Did you see anything pointing towards this dependency in your install logs or during the runtime or is this pure guessing?
Itβs a pure guess, and it seems silly now.
I didnβt install any other packages depending on CUDA11.4
Also after uninstaling, the otput of collecting environment didnβt change
Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17
Python version: 3.8.11 (default, Sep 1 2021, 12:33:46) [GCC 9.3.1 20200408 (Red Hat 9.3.1-2)] (64-bit runtime)
Python platform: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.4.120
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.2.4
/usr/lib64/libcudnn_adv_infer.so.8.2.4
/usr/lib64/libcudnn_adv_train.so.8.2.4
/usr/lib64/libcudnn_cnn_infer.so.8.2.4
/usr/lib64/libcudnn_cnn_train.so.8.2.4
/usr/lib64/libcudnn_ops_infer.so.8.2.4
/usr/lib64/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.23.4
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1
[pip3] torchcam==0.3.2
[pip3] torchvision==0.14.1
[conda] Could not collect
same error
python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.1 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.0.76
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 525.60.13
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.24.0
[pip3] torch==1.13.1
[conda] cudatoolkit 11.6.0 hecad31d_10 conda-forge
[conda] numpy 1.21.0 pypi_0 pypi
[conda] torch 1.12.0 pypi_0 pypi
[conda] torchaudio 0.12.0+cu116 pypi_0 pypi
[conda] torchtext 0.13.0 pypi_0 pypi
[conda] torchvision 0.13.0+cu116 pypi_0 pypi
I had same issues as the above logs and figured with above pointers that the issue was mismatch b.w CUDA runtime and pytorch biuld.
- I think this happened (everything was running smoothly a day back) because I did
update
of ubuntu packages.
In order to resolve:
- I just purged nvidia drivers and re-installed 11.7 CUDA and CUDNN.
- I did not re-install pytorch.
Faced the same issue. There was no mismatch in tensor shapes; had to fix the pytorch CUDA version (11.7) to be compatible with my systemβs CUDA 11.6 (ended up downloading PyTorch 1.13cu11.6)
I had the same error and the root cause for me was also a mismatch in pytorch cuda and system cuda. I used torch 1.13.1 with cuda 11.6 and then it worked. (my docker image has cuda 11.6)
I solved this error of torch.matmul, use python -m torch.utils.collect_env I find the cudnn is different path of the cuda. The new torch version after 1.2 very strict for cuda and cudnn version. I correct it, solved this problem
i was able to resolve this by adding the conda enviornment library to my LD_LIBRARY_PATH variable:
export LD_LIBRARY_PATH=/home/$USER/.conda/envs/$ENVNAME/lib:/usr/local/cuda-11/lib64
I get this error - Please let me know if you may have any suggestions?
(gdrnpp) mona@ada:~/gdrnpp_bop2022$ ./det/yolox/tools/test_yolox.sh ./configs/yolox/bop_pbr/yolox_x_640_augCozyAAEhsv_ranger_30_epochs_mona_bop_test.py 0 ./output/yolox/bop_pbr/yolox_x_640_augCozyAAEhsv_ranger_30_epochs_mona_bop_test/model_final.pth
_module.pnp_net.features.0.weight
_module.pnp_net.features.1.{bias, weight}
_module.pnp_net.features.3.weight
_module.pnp_net.features.4.{bias, weight}
_module.pnp_net.features.6.weight
_module.pnp_net.features.7.{bias, weight}
_module.pnp_net.fc1.{bias, weight}
_module.pnp_net.fc2.{bias, weight}
_module.pnp_net.fc_r.{bias, weight}
_module.pnp_net.fc_t.{bias, weight}
[0207_134552 detectron2@57]: Fusing conv bn...
ERROR [0207_134553 d2.engine.launch@82]: An error has been caught in function 'launch', process 'MainProcess' (839409), thread 'MainThread' (140334550648640):
Traceback (most recent call last):
File "/home/mona/gdrnpp_bop2022/./det/yolox/tools/main_yolox.py", line 70, in <module>
launch(
-> <function launch at 0x7fa2132cac10>
> File "/home/mona/anaconda3/envs/gdrnpp/lib/python3.9/site-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
| -> (Namespace(config_file='./configs/yolox/bop_pbr/yolox_x_640_augCozyAAEhsv_ranger_30_epochs_mona_bop_test.py', resume=False,...
-> <function main at 0x7fa0a6bc98b0>
File "/home/mona/gdrnpp_bop2022/./det/yolox/tools/main_yolox.py", line 58, in main
model = fuse_model(model)
| -> YOLOX(
| (backbone): YOLOPAFPN(
| (backbone): CSPDarknet(
| (stem): Focus(
| (conv): BaseConv(
| (conv): ...
-> <function fuse_model at 0x7fa113d48310>
File "/home/mona/gdrnpp_bop2022/det/yolox/tools/../../../det/yolox/utils/model_utils.py", line 67, in fuse_model
m.conv = fuse_conv_and_bn(m.conv, m.bn) # update conv
| | | -> BaseConv(
| | | (conv): Conv2d(12, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
| | | (bn): BatchNorm2d(80, eps...
| | -> BaseConv(
| | (conv): Conv2d(12, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
| | (bn): BatchNorm2d(80, eps...
| -> <function fuse_conv_and_bn at 0x7fa113d48280>
-> BaseConv(
(conv): Conv2d(12, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(80, eps...
File "/home/mona/gdrnpp_bop2022/det/yolox/tools/../../../det/yolox/utils/model_utils.py", line 57, in fuse_conv_and_bn
fusedconv.bias.copy_(torch.mm(w_bn, b_conv.reshape(-1, 1)).reshape(-1) + b_bn)
| | | | | | -> tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
| | | | | | 0., 0., 0., 0...
| | | | | -> <method 'reshape' of 'torch._C._TensorBase' objects>
| | | | -> tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
| | | | 0., 0., 0., 0...
| | | -> tensor([[0.9995, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
| | | [0.0000, 0.9995, 0.0000, ..., 0.0000, 0.0000, 0.0000...
| | -> <built-in method mm of type object at 0x7fa1e1695ee0>
| -> <module 'torch' from '/home/mona/anaconda3/envs/gdrnpp/lib/python3.9/site-packages/torch/__init__.py'>
-> Conv2d(12, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
and I have:
(gdrnpp) mona@ada:~/gdrnpp_bop2022$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.10.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
GPU models and configuration: GPU 0: NVIDIA RTX 6000 Ada Generation
Nvidia driver version: 535.104.12
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.3
[pip3] pytorch-lightning==1.6.0
[pip3] torch==1.10.1
[pip3] torchaudio==0.10.1
[pip3] torchmetrics==1.3.0.post0
[pip3] torchvision==0.11.2
[conda] blas 1.0 mkl conda-forge
[conda] cudatoolkit 11.3.1 hb98b00a_12 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mypy-extensions 1.0.0 pypi_0 pypi
[conda] numpy 1.26.3 py39h474f0d3_0 conda-forge
[conda] pytorch 1.10.1 py3.9_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-lightning 1.6.0 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 2.2.0 pypi_0 pypi
[conda] torchaudio 0.10.1 py39_cu113 pytorch
[conda] torchmetrics 1.3.0.post0 pypi_0 pypi
[conda] torchvision 0.11.2 py39_cu113 pytorch
I am using this repo
Could you post a minimal and executable code snippet reproducing the issue, please?
Sorry for the delayed response.
Hereβs the code and error again (I am using this repo GitHub - jabarragann/gdrnpp_bop2022 at 21d103da8716755f6e3c73a9e127d7efd3852eed along with recommended setup here: gdrnpp_bop2022/JuanInstallation.md at 21d103da8716755f6e3c73a9e127d7efd3852eed Β· jabarragann/gdrnpp_bop2022 Β· GitHub
(juan-gdrnpp) mona@ada:~/juan/gdrnpp_bop2022$ ./core/gdrn_modeling/train_gdrn.sh configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_suturing.py 0
20240219_111807|core.utils.my_writer@198: eta: 0:00:32 epoch: 1 iter: 90/130[69.2%] time: 0.8363 lr: 7.2728e-05 max_mem: 25740M total_loss: 249.4 (243.8) loss_coor_x: 0.1796 (0.2404) loss_coor_y: 0.1663 (0.4394) loss_coor_z: 0.2217 (0.309) loss_mask: 0.02429 (0.03835) loss_mask_full: 0.03292 (0.06163) loss_region: 248.3 (242.2) loss_PM_R: 0.01011 (0.01141) loss_centroid: 0.3883 (0.3848) loss_z: 0.118 (0.1263)
epoch_str: epoch: 1
20240219_111808|core.utils.my_writer@198: eta: 0:00:31 epoch: 1 iter: 91/130[70.0%] time: 0.8361 lr: 7.3527e-05 max_mem: 25740M total_loss: 249.4 (243.7) loss_coor_x: 0.1713 (0.2394) loss_coor_y: 0.1659 (0.4363) loss_coor_z: 0.2217 (0.3085) loss_mask: 0.02429 (0.03822) loss_mask_full: 0.03357 (0.06136) loss_region: 248.3 (242.1) loss_PM_R: 0.01011 (0.01141) loss_centroid: 0.3883 (0.3849) loss_z: 0.1173 (0.1261)
epoch_str: epoch: 1
20240219_111809|core.utils.my_writer@198: eta: 0:00:30 epoch: 1 iter: 92/130[70.8%] time: 0.8357 lr: 7.4326e-05 max_mem: 25740M total_loss: 249.4 (243.9) loss_coor_x: 0.1643 (0.2384) loss_coor_y: 0.1628 (0.4332) loss_coor_z: 0.2217 (0.3074) loss_mask: 0.02429 (0.03803) loss_mask_full: 0.03421 (0.06108) loss_region: 248.3 (242.3) loss_PM_R: 0.01011 (0.01141) loss_centroid: 0.3883 (0.3846) loss_z: 0.1166 (0.1259)
epoch_str: epoch: 1
20240219_111810|core.utils.my_writer@198: eta: 0:00:30 epoch: 1 iter: 93/130[71.5%] time: 0.8355 lr: 7.5126e-05 max_mem: 25740M total_loss: 250.1 (244.1) loss_coor_x: 0.1634 (0.2375) loss_coor_y: 0.1598 (0.4302) loss_coor_z: 0.2192 (0.3063) loss_mask: 0.02379 (0.03787) loss_mask_full: 0.03421 (0.06076) loss_region: 249 (242.5) loss_PM_R: 0.01011 (0.0114) loss_centroid: 0.3886 (0.3848) loss_z: 0.1166 (0.1256)
epoch_str: epoch: 1
20240219_111810|core.utils.my_writer@198: eta: 0:00:29 epoch: 1 iter: 94/130[72.3%] time: 0.8351 lr: 7.5925e-05 max_mem: 25740M total_loss: 250.1 (244) loss_coor_x: 0.1626 (0.2366) loss_coor_y: 0.1591 (0.4273) loss_coor_z: 0.2182 (0.3051) loss_mask: 0.02426 (0.03773) loss_mask_full: 0.03421 (0.06046) loss_region: 249 (242.4) loss_PM_R: 0.01008 (0.01138) loss_centroid: 0.3904 (0.3849) loss_z: 0.1165 (0.1254)
epoch_str: epoch: 1
20240219_111811|core.utils.my_writer@198: eta: 0:00:28 epoch: 1 iter: 95/130[73.1%] time: 0.8349 lr: 7.6724e-05 max_mem: 25740M total_loss: 250.1 (243.9) loss_coor_x: 0.1616 (0.2358) loss_coor_y: 0.1575 (0.4244) loss_coor_z: 0.2182 (0.3043) loss_mask: 0.02426 (0.03763) loss_mask_full: 0.03421 (0.06016) loss_region: 249 (242.3) loss_PM_R: 0.01019 (0.01138) loss_centroid: 0.3904 (0.3847) loss_z: 0.1154 (0.1251)
epoch_str: epoch: 1
20240219_111812|core.utils.my_writer@198: eta: 0:00:27 epoch: 1 iter: 96/130[73.8%] time: 0.8348 lr: 7.7523e-05 max_mem: 25740M total_loss: 249.4 (243.6) loss_coor_x: 0.1596 (0.2347) loss_coor_y: 0.1564 (0.4215) loss_coor_z: 0.2182 (0.3028) loss_mask: 0.02357 (0.03747) loss_mask_full: 0.03471 (0.05991) loss_region: 248.3 (242) loss_PM_R: 0.01019 (0.01136) loss_centroid: 0.3886 (0.3847) loss_z: 0.1142 (0.1248)
epoch_str: epoch: 1
20240219_111813|core.utils.my_writer@198: eta: 0:00:26 epoch: 1 iter: 97/130[74.6%] time: 0.8347 lr: 7.8322e-05 max_mem: 25740M total_loss: 249.4 (243.7) loss_coor_x: 0.1565 (0.2338) loss_coor_y: 0.1562 (0.4188) loss_coor_z: 0.216 (0.3019) loss_mask: 0.02357 (0.03731) loss_mask_full: 0.03471 (0.05962) loss_region: 248.3 (242.2) loss_PM_R: 0.01019 (0.01136) loss_centroid: 0.3883 (0.3844) loss_z: 0.1133 (0.1245)
epoch_str: epoch: 1
20240219_111814|core.utils.my_writer@198: eta: 0:00:25 epoch: 1 iter: 98/130[75.4%] time: 0.8345 lr: 7.9122e-05 max_mem: 25740M total_loss: 249.4 (243.6) loss_coor_x: 0.1542 (0.2328) loss_coor_y: 0.1556 (0.4159) loss_coor_z: 0.216 (0.3011) loss_mask: 0.02309 (0.03715) loss_mask_full: 0.03443 (0.05937) loss_region: 248.3 (242.1) loss_PM_R: 0.01019 (0.01135) loss_centroid: 0.3886 (0.3846) loss_z: 0.1108 (0.1242)
epoch_str: epoch: 1
20240219_111815|core.utils.my_writer@198: eta: 0:00:25 epoch: 1 iter: 99/130[76.2%] time: 0.8344 lr: 7.9921e-05 max_mem: 25740M total_loss: 250.1 (243.8) loss_coor_x: 0.153 (0.2318) loss_coor_y: 0.1556 (0.4133) loss_coor_z: 0.2154 (0.3) loss_mask: 0.02309 (0.03698) loss_mask_full: 0.03401 (0.05911) loss_region: 249 (242.2) loss_PM_R: 0.01023 (0.01134) loss_centroid: 0.3886 (0.385) loss_z: 0.1089 (0.1239)
epoch_str: epoch: 1
20240219_111839|core.utils.my_writer@198: eta: 0:00:00 epoch: 1 iter: 129/130[99.2%] time: 0.8304 lr: 0.0001039 max_mem: 25740M total_loss: 232.6 (243.4) loss_coor_x: 0.1349 (0.2106) loss_coor_y: 0.1493 (0.352) loss_coor_z: 0.2137 (0.2825) loss_mask: 0.02592 (0.03421) loss_mask_full: 0.03323 (0.05327) loss_region: 231.7 (241.9) loss_PM_R: 0.01137 (0.01129) loss_centroid: 0.3711 (0.3828) loss_z: 0.02189 (0.104)
20240219_111839|fvcore.common.checkpoint@124: Saving checkpoint to output/gdrn/ambf_suturing/classAware_ambf_suturing_v0.0.2/model_final.pth
20240219_111843|core.gdrn_modeling.datasets.ambf_suturing@100: load cached dataset dicts from /home/mona/juan/gdrnpp_bop2022/.cache/dataset_dicts_ambf_suturing_test_fbe937aad3f79390e82b4f4b5d9b093d.pkl
20240219_111843|core.gdrn_modeling.datasets.data_loader@190: Serializing 1504 elements to byte tensors and concatenating them all ...
20240219_111843|core.gdrn_modeling.datasets.data_loader@195: Serialized dataset takes 2.43 MiB
20240219_111843|core.gdrn_modeling.engine.gdrn_evaluator@689: Start inference on 1504 images
20240219_111844|ERR|__main__@233: An error has been caught in function '<module>', process 'MainProcess' (461099), thread 'MainThread' (140493061670720):
Traceback (most recent call last):
> File "/home/mona/juan/gdrnpp_bop2022/./core/gdrn_modeling/main_gdrn.py", line 233, in <module>
main(args)
β β Namespace(config_file='configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_su...
β <function main at 0x7fc57ed4d820>
File "/home/mona/juan/gdrnpp_bop2022/./core/gdrn_modeling/main_gdrn.py", line 199, in main
Lite(
β <class '__main__.Lite'>
File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/pytorch_lightning/lite/lite.py", line 408, in _run_impl
return run_method(*args, **kwargs)
β β β {}
β β (Namespace(config_file='configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_s...
β functools.partial(<bound method LightningLite._run_with_strategy_setup of <__main__.Lite object at 0x7fc716c83130>>, <bound m...
File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/pytorch_lightning/lite/lite.py", line 413, in _run_with_strategy_setup
return run_method(*args, **kwargs)
β β β {}
β β (Namespace(config_file='configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_s...
β <bound method Lite.run of <__main__.Lite object at 0x7fc716c83130>>
File "/home/mona/juan/gdrnpp_bop2022/./core/gdrn_modeling/main_gdrn.py", line 189, in run
return self.do_test(cfg, model)
β β β β _LiteModule(
β β β (_module): GDRN_DoubleMask(
β β β (backbone): FeatureListNet(
β β β (stem_0): Conv2d(3, 128, kernel_size=(4, 4),...
β β β Config (path: configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_suturing.py...
β β <function GDRN_Lite.do_test at 0x7fc5865f0dc0>
β <__main__.Lite object at 0x7fc716c83130>
File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/engine/engine.py", line 159, in do_test
results_i = gdrn_inference_on_dataset(cfg, model, data_loader, evaluator, amp_test=cfg.TEST.AMP_TEST)
β β β β β β Config (path: configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_suturing.py...
β β β β β <core.gdrn_modeling.engine.gdrn_evaluator.GDRN_Evaluator object at 0x7fc5dcb72dc0>
β β β β <pytorch_lightning.lite.wrappers._LiteDataLoader object at 0x7fc5dc6c0850>
β β β _LiteModule(
β β (_module): GDRN_DoubleMask(
β β (backbone): FeatureListNet(
β β (stem_0): Conv2d(3, 128, kernel_size=(4, 4),...
β β Config (path: configs/gdrn/ambf_suturing/convnext_a6_AugCosyAAEGray_BG05_mlL1_DMask_amodalClipBox_classAware_ambf_suturing.py...
β <function gdrn_inference_on_dataset at 0x7fc58689b670>
File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/engine/gdrn_evaluator.py", line 737, in gdrn_inference_on_dataset
out_dict = model(
β _LiteModule(
(_module): GDRN_DoubleMask(
(backbone): FeatureListNet(
(stem_0): Conv2d(3, 128, kernel_size=(4, 4),...
File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
β β β {'roi_classes': tensor([0], device='cuda:0'), 'roi_cams': tensor([[[350.8070, 0.0000, 320.0000],
β β [ 0.0000, 350.80...
β β (tensor([[[[0.5490, 0.5490, 0.5490, ..., 0.6941, 0.6941, 0.6902],
β [0.5490, 0.5490, 0.5490, ..., 0.6824, 0.6824, 0...
β <bound method _LiteModule.forward of _LiteModule(
(_module): GDRN_DoubleMask(
(backbone): FeatureListNet(
(stem_0...
File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/pytorch_lightning/lite/wrappers.py", line 105, in forward
output = self.module(*args, **kwargs)
β β β β {'roi_classes': tensor([0], device='cuda:0'), 'roi_cams': tensor([[[350.8070, 0.0000, 320.0000],
β β β [ 0.0000, 350.80...
β β β (tensor([[[[0.5490, 0.5490, 0.5490, ..., 0.6941, 0.6941, 0.6902],
β β [0.5490, 0.5490, 0.5490, ..., 0.6824, 0.6824, 0...
β β <property object at 0x7fc5f0c624f0>
β _LiteModule(
(_module): GDRN_DoubleMask(
(backbone): FeatureListNet(
(stem_0): Conv2d(3, 128, kernel_size=(4, 4),...
File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
β β β {'roi_classes': tensor([0], device='cuda:0'), 'roi_cams': tensor([[[350.8070, 0.0000, 320.0000],
β β [ 0.0000, 350.80...
β β (tensor([[[[0.5490, 0.5490, 0.5490, ..., 0.6941, 0.6941, 0.6902],
β [0.5490, 0.5490, 0.5490, ..., 0.6824, 0.6824, 0...
β <bound method GDRN_DoubleMask.forward of GDRN_DoubleMask(
(backbone): FeatureListNet(
(stem_0): Conv2d(3, 128, kernel_s...
File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/models/GDRN_double_mask.py", line 158, in forward
pred_rot_, pred_t_ = self.pnp_net(
β GDRN_DoubleMask(
(backbone): FeatureListNet(
(stem_0): Conv2d(3, 128, kernel_size=(4, 4), stride=(4, 4))
(stem_1): ...
File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
β β β {'region': tensor([[[[0.0157, 0.0155, 0.0153, ..., 0.0153, 0.0153, 0.0156],
β β [0.0155, 0.0152, 0.0151, ..., 0.0150,...
β β (tensor([[[[-8.9684e-03, -8.9233e-03, -8.8213e-03, ..., -9.3900e-03,
β -9.4768e-03, -9.1552e-03],
β [-8.993...
β <bound method ConvPnPNet.forward of ConvPnPNet(
(act): GELU()
(dropblock): LinearScheduler(
(dropblock): DropBlock2D(...
File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/models/heads/conv_pnp_net.py", line 178, in forward
x = self.act(self.fc1(flat_conv_feat))
β β β tensor([[ 0.7126, 0.0505, 0.0421, ..., -0.0887, -0.0852, -0.0815]],
β β device='cuda:0')
β β ConvPnPNet(
β (act): GELU()
β (dropblock): LinearScheduler(
β (dropblock): DropBlock2D()
β )
β (features): ModuleList(
β ...
β ConvPnPNet(
(act): GELU()
(dropblock): LinearScheduler(
(dropblock): DropBlock2D()
)
(features): ModuleList(
...
File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
β β β {}
β β (tensor([[ 0.7126, 0.0505, 0.0421, ..., -0.0887, -0.0852, -0.0815]],
β device='cuda:0'),)
β <bound method Linear.forward of Linear(in_features=8192, out_features=1024, bias=True)>
File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
β β β β β Linear(in_features=8192, out_features=1024, bias=True)
β β β β Linear(in_features=8192, out_features=1024, bias=True)
β β β tensor([[ 0.7126, 0.0505, 0.0421, ..., -0.0887, -0.0852, -0.0815]],
β β device='cuda:0')
β β <function linear at 0x7fc6c9282160>
β <module 'torch.nn.functional' from '/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/functional.py'>
File "/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
β β β β β β β Parameter containing:
β β β β β β tensor([-6.3990e-05, -1.7071e-04, 8.4036e-05, ..., 4.6594e-04,
β β β β β β -1.6388e-04, -2.0591e-05], de...
β β β β β β Parameter containing:
β β β β β tensor([[ 9.1223e-04, -6.6741e-04, -8.0336e-04, ..., -1.6803e-03,
β β β β β 2.4945e-03, 2.0192e-03],
β β β β β ...
β β β β β tensor([[ 0.7126, 0.0505, 0.0421, ..., -0.0887, -0.0852, -0.0815]],
β β β β device='cuda:0')
β β β β <built-in function linear>
β β β <module 'torch._C._nn'>
β β <module 'torch._C' from '/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/_C.cpython-39-x86_64-linux-g...
β <module 'torch' from '/home/mona/anaconda3/envs/juan-gdrnpp/lib/python3.9/site-packages/torch/__init__.py'>
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I am running the training for only 1 epoch so I can debug it.
The code:
(juan-gdrnpp) mona@ada:~/juan/gdrnpp_bop2022$ gedit /home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/engine/gdrn_evaluator.py
File "/home/mona/juan/gdrnpp_bop2022/core/gdrn_modeling/../../core/gdrn_modeling/engine/gdrn_evaluator.py", line 737, in gdrn_inference_on_dataset
^ based on this error.
^^ I couldnβt copy the code due to limit error.
Thank you, your solution works for me.