So I have this test script in order to debug another problem I have with torch.matmul. Please note I need to use this specific version of PyTorch and torchvision due to reproducibility reason of HybridPose framework.
test.py is:
import torch
a = torch.rand(2, 3, device='cuda')
b = torch.rand(3, 2, device='cuda')
try:
c = torch.matmul(a, b)
except RuntimeError as e:
print(e)
After running, it just keeps showing as stuck:
![]()
Here’s the environment setup:
$ conda list
# packages in environment at /home/mona/anaconda3/envs/hp:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
_pytorch_select 0.2 gpu_0
blas 1.0 mkl
ca-certificates 2023.08.22 h06a4308_0
certifi 2022.12.7 py37h06a4308_0
cffi 1.15.0 py37h7f8727e_0
cudatoolkit 10.0.130 0
cudnn 7.6.5 cuda10.0_0
freetype 2.12.1 h4a9f257_0
giflib 5.2.1 h5eee18b_3
intel-openmp 2022.1.0 h9e868ea_3769
joblib 1.3.2 pypi_0 pypi
jpeg 9e h5eee18b_1
lcms2 2.12 h3be6417_0
lerc 3.0 h295c915_0
libdeflate 1.17 h5eee18b_1
libedit 3.1.20221030 h5eee18b_0
libffi 3.2.1 hf484d3e_1007
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libpng 1.6.39 h5eee18b_0
libstdcxx-ng 11.2.0 h1234567_1
libtiff 4.5.1 h6a678d5_0
libwebp 1.2.4 h11a3e52_1
libwebp-base 1.2.4 h5eee18b_1
lz4-c 1.9.4 h6a678d5_0
mkl 2020.2 256
mkl-service 2.3.0 py37he8ac12f_0
mkl_fft 1.3.0 py37h54f3939_0
mkl_random 1.1.1 py37h0573a6f_0
ncurses 6.4 h6a678d5_0
ninja 1.10.2 h06a4308_5
ninja-base 1.10.2 hd09550d_5
numpy 1.19.2 py37h54aff64_0
numpy-base 1.19.2 py37hfa32c7d_0
opencv-python 4.8.1.78 pypi_0 pypi
openssl 1.1.1w h7f8727e_0
pillow 6.2.2 pypi_0 pypi
pip 22.3.1 py37h06a4308_0
pycparser 2.21 pyhd3eb1b0_0
python 3.7.4 h265db76_1
pytorch 1.2.0 cuda100py37h938c94c_0
readline 7.0 h7b6447c_5
scikit-learn 0.21.3 pypi_0 pypi
scipy 1.7.3 pypi_0 pypi
setuptools 65.5.1 pypi_0 pypi
six 1.16.0 pyhd3eb1b0_1
sqlite 3.33.0 h62c20be_0
tk 8.6.12 h1ccaba5_0
torchvision 0.4.0 cuda100py37hecfc37a_0
wheel 0.38.4 py37h06a4308_0
xz 5.4.2 h5eee18b_0
zlib 1.2.13 h5eee18b_0
zstd 1.5.5 hc292b87_0
and
(hp) mona@mona-ThinkStation-P7:~$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.2.0'
>>> import torchvision
>>> torchvision.__version__
'0.4.0a0'
>>>
and
(hp) mona@mona-ThinkStation-P7:~$ python -c "import torch; print(torch.version.cuda)"
10.0.130
and
(hp) mona@mona-ThinkStation-P7:~$ uname -a
Linux mona-ThinkStation-P7 6.2.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct 6 10:23:26 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
(hp) mona@mona-ThinkStation-P7:~$ lsb_release -a
LSB Version: core-11.1.0ubuntu4-noarch:security-11.1.0ubuntu4-noarch
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
and
(hp) mona@mona-ThinkStation-P7:~$ nvidia-smi
Tue Oct 24 08:35:58 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX 6000 Ada Gene... On | 00000000:52:00.0 On | Off |
| 30% 59C P2 73W / 300W | 4008MiB / 49140MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2417 G /usr/lib/xorg/Xorg 452MiB |
| 0 N/A N/A 2597 G /usr/bin/gnome-shell 68MiB |
| 0 N/A N/A 3098 G ...AAAAAAAACAAAAAAAAAA= --shared-files 57MiB |
| 0 N/A N/A 3447 G ...irefox/3252/usr/lib/firefox/firefox 357MiB |
| 0 N/A N/A 8414 C python 608MiB |
| 0 N/A N/A 8704 C python 654MiB |
| 0 N/A N/A 8973 C python 692MiB |
| 0 N/A N/A 9484 G ...sion,SpareRendererForSitePerProcess 111MiB |
| 0 N/A N/A 12323 C python 890MiB |
+---------------------------------------------------------------------------------------+
and
(hp) mona@mona-ThinkStation-P7:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
and here’s the original error I got when I started the training from the pretrained weights for ape class in LINEMOD dataset for HybridPose framework:
(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose$ LD_LIBRARY_PATH=lib/regressor:$LD_LIBRARY_PATH python src/train_core.py --load_dir /home/mona/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199 --object_name ape
number of model parameters: 12959563
loading checkpoint from /home/mona/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199
Successfully loaded model from /home/mona/HP/HybridPose/saved_weights/linemod/ape/checkpoints/0.001/199
/home/mona/anaconda3/envs/hp/lib/python3.7/site-packages/torch/nn/functional.py:1350: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
Traceback (most recent call last):
File "src/train_core.py", line 114, in <module>
trainer.generate_data(val_loader)
File "./trainers/coretrainer.py", line 572, in generate_data
pts2d_pred_loc, pts2d_pred_var = self.vote_keypoints(pts2d_map_pred, mask_pred)
File "./trainers/coretrainer.py", line 324, in vote_keypoints
mean, var = estimate_voting_distribution_with_mean(mask, pts2d_map, mean)
File "/home/mona/HP/HybridPose/lib/ransac_voting_gpu_layer/ransac_voting_gpu.py", line 400, in estimate_voting_distribution_with_mean
cov=torch.matmul(diff_pts.transpose(2,3), weighted_diff_pts) # b,vn,2,2
RuntimeError: cublas runtime error : the GPU program failed to execute at /tmp/pip-req-build-58y_cjjl/aten/src/THC/THCBlas.cu:331
The git repo is accessible from here GitHub - chensong1995/HybridPose: HybridPose: 6D Object Pose Estimation under Hybrid Representation (CVPR 2020)
Please note the requirements.txt for this repo states these exact versions for pytorch, torchvision, and cudatoolkit:
(hp) mona@mona-ThinkStation-P7:~/HP/HybridPose$ cat requirements.txt
pillow>=6.2.2
pytorch==1.2.0
torchvision==0.4.0
cudatoolkit==10.0.130
opencv==3.4.7
setuptools==65.5.1
scikit-learn==0.21.3