Vertices=torch.matmul(vertices.unsqueeze(0), rotations_init), RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched in CentOS

How can I fix this?

(phosa) [jalal@goku phosa]$ python demo.py --filename input/dark_bat.jpg --class_name bat
2021-03-26 16:55:48,497 INFO     Calling with args: Namespace(class_name='bat', filename='input/dark_bat.jpg', lw_collision=None, lw_depth=None, lw_inter=None, lw_inter_part=None, lw_scale=None, lw_scale_person=None, lw_sil=None, mesh_index=0, output_dir='output')
2021-03-26 16:55:51,993 INFO     Loading checkpoint from detectron2://PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 16:55:51,995 INFO     URL https://dl.fbaipublicfiles.com/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl cached in /home/grad3/jalal/.torch/fvcore_cache/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 16:55:52,104 INFO     Reading a file from 'Detectron2 Model Zoo'
WARNING: You are using a SMPL model, with only 10 shape coefficients.
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /home/grad3/jalal/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 97.8M/97.8M [00:00<00:00, 108MB/s]
class_name:  bat
  0%|                                                                                                  | 0/12500.0 [00:00<?, ?it/s]Traceback (most recent call last):
  File "demo.py", line 145, in <module>
    main(get_args())
  File "demo.py", line 121, in main
    instances=instances, class_name=args.class_name, mesh_index=args.mesh_index
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 406, in find_optimal_poses
    num_initializations=num_initializations,
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 287, in find_optimal_pose
    vertices=torch.matmul(vertices.unsqueeze(0), rotations_init),
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
  0%|                                                                                                  | 0/12500.0 [00:00<?, ?it/s]
(phosa) [jalal@goku ~]$ python
Python 3.6.8 (default, Nov 16 2020, 16:55:22) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.8.1+cu111'
>>> torch.cuda.is_available()
True

and

$ lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.9.2009 (Core)
Release:	7.9.2009
Codename:	Core

and

(phosa) [jalal@goku ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

and

(phosa) [jalal@goku ~]$ python collect_env.py 
Collecting environment information...
PyTorch version: 1.8.1+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.9.2009 (Core) (x86_64)
GCC version: (GCC) 7.3.0
Clang version: 3.4.2 (tags/RELEASE_34/dot2-final)
CMake version: version 3.10.0-rc5

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 450.51.06
cuDNN version: Probably one of the following:
/scratch/system/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudnn.so.7.4.2
/scratch/system/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7.0.5
/scratch/system/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/scratch2/system/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/scratch2/system/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.8.0.2
/scratch2/system/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.2
/scratch2/system/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.2
/scratch2/system/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.2
/scratch2/system/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.2
/scratch2/system/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.2
/scratch2/system/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.2
/scratch2/system/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.5.1.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] neural-renderer-pytorch==1.1.3
[pip3] numpy==1.19.5
[pip3] torch==1.8.1+cu111
[pip3] torchaudio==0.8.1
[pip3] torchgeometry==0.1.2
[pip3] torchvision==0.9.1+cu111
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.0.130                      0  
[conda] mkl                       2018.0.3                      1  
[conda] mkl-service               1.1.2            py36h17a0993_4  
[conda] mkl_fft                   1.0.10                   py36_0    conda-forge
[conda] mkl_random                1.0.2                    py36_0    conda-forge
[conda] msgpack-numpy             0.4.4.3                  pypi_0    pypi
[conda] numpy                     1.19.4                   pypi_0    pypi
[conda] numpydoc                  0.9.1                      py_0    conda-forge
[conda] pytorch                   1.1.0           py3.6_cuda10.0.130_cudnn7.5.1_0    pytorch
[conda] pytorch-nightly           1.0.0.dev20190328 py3.6_cuda10.0.130_cudnn7.4.2_0    pytorch
[conda] pytorchviz                0.0.1                    pypi_0    pypi
[conda] torchvision               0.3.0           py36_cu10.0.130_1    pytorch

Please let me know if more details is required.

Your NVIDIA driver might be too old (450.51.06) to use the CUDA11.1 runtime.
Table 1 in this doc gives:

CUDA Toolkit	Linux x86_64 Driver Version
CUDA 11.2	>= 450.80.02
CUDA 11.1 (11.1.0)	>= 450.80.02
CUDA 11.0 (11.0.3)	>= 450.36.06

You could try to install the CUDA10.2 binaries or update the driver.

1 Like

Thanks a lot @ptrblck I updated my driver but still get the same exact error after rebooting the system

(phosa) [jalal@goku phosa]$ python demo.py --filename input/dark_bat.jpg --class_name bat
2021-03-26 21:23:43,700 INFO     Calling with args: Namespace(class_name='bat', filename='input/dark_bat.jpg', lw_collision=None, lw_depth=None, lw_inter=None, lw_inter_part=None, lw_scale=None, lw_scale_person=None, lw_sil=None, mesh_index=0, output_dir='output')
2021-03-26 21:24:03,129 INFO     Loading checkpoint from detectron2://PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 21:24:03,183 INFO     URL https://dl.fbaipublicfiles.com/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl cached in /home/grad3/jalal/.torch/fvcore_cache/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 21:24:05,310 INFO     Reading a file from 'Detectron2 Model Zoo'
WARNING: You are using a SMPL model, with only 10 shape coefficients.
class_name:  bat
  0%|                                                                                         | 0/800.0 [00:00<?, ?it/s]Traceback (most recent call last):
  File "demo.py", line 145, in <module>
    main(get_args())
  File "demo.py", line 121, in main
    instances=instances, class_name=args.class_name, mesh_index=args.mesh_index
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 406, in find_optimal_poses
    num_initializations=num_initializations,
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 287, in find_optimal_pose
    vertices=torch.matmul(vertices.unsqueeze(0), rotations_init),
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
  0%|                                                                                         | 0/800.0 [00:00<?, ?it/s]

I installed the new driver using this file: NVIDIA-Linux-x86_64-460.67.run

Could you please let me know what exact driver/file I should use or what exact command?

Could you lower the batch size and see, if you might hit an our of memory issue?
Cublas might try to create the handle (and workspace) internally and might be running out of memory.
If that’s not the case, could you post the model definition as well as input shapes, so that we could take a look at it?

Reduced the batch_size from 128 to 8 and same error

(phosa) [jalal@goku phosa]$ python demo.py --filename input/dark_bat.jpg --class_name bat
2021-03-26 21:30:00,015 INFO     Calling with args: Namespace(class_name='bat', filename='input/dark_bat.jpg', lw_collision=None, lw_depth=None, lw_inter=None, lw_inter_part=None, lw_scale=None, lw_scale_person=None, lw_sil=None, mesh_index=0, output_dir='output')
2021-03-26 21:30:02,784 INFO     Loading checkpoint from detectron2://PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 21:30:02,790 INFO     URL https://dl.fbaipublicfiles.com/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl cached in /home/grad3/jalal/.torch/fvcore_cache/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-26 21:30:02,944 INFO     Reading a file from 'Detectron2 Model Zoo'
WARNING: You are using a SMPL model, with only 10 shape coefficients.
class_name:  bat
  0%|                                                                                       | 0/12500.0 [00:00<?, ?it/s]Traceback (most recent call last):
  File "demo.py", line 145, in <module>
    main(get_args())
  File "demo.py", line 121, in main
    instances=instances, class_name=args.class_name, mesh_index=args.mesh_index
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 406, in find_optimal_poses
    num_initializations=num_initializations,
  File "/scratch3/research/code/phosa/phosa/pose_optimization.py", line 287, in find_optimal_pose
    vertices=torch.matmul(vertices.unsqueeze(0), rotations_init),
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
  0%|                                                                                       | 0/12500.0 [00:00<?, ?it/s]

@ptrblck I am not sure how to output the following two since this an off-the-shelf code. How should I do that?
model definition as well as input shapes

I printed the model but it doesn’t get to that point so I am not sure how to respond you

I assume you are using the model from another repository then?
If so, could you post the link here and how you are initializing the model as well as the input shapes, which cause this error?

1 Like

of course. here is the repo GitHub - facebookresearch/phosa: Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild
I have to note that i got this working in my Ubuntu 20.04 machine with no problem but its GPU has only 4GB memory hence I moved to my CentOS machine which has 12GB memory GPUs.
the installed packages in my Ubuntu:

(phosa) mona@goku:~$ pip list
Package                 Version            Location                                                                                       
----------------------- ------------------ -----------------------------------------------------------------------------------------------
absl-py                 0.11.0             
argon2-cffi             20.1.0             
async-generator         1.10               
attrs                   20.3.0             
backcall                0.2.0              
bleach                  3.3.0              
cachetools              4.2.0              
certifi                 2020.12.5          
cffi                    1.14.5             
chardet                 4.0.0              
chumpy                  0.70               
click                   7.1.2              
cloudpickle             1.6.0              
cycler                  0.10.0             
Cython                  0.29.21            
decorator               4.4.2              
defusedxml              0.7.1              
detectron2              0.4+cu111          
easydict                1.9                
entrypoints             0.3                
faster-rcnn             0.1                /home/mona/research/code/phosa/external/frankmocap/detectors/hand_object_detector/lib
filelock                3.0.12             
Flask                   1.1.2              
future                  0.18.2             
fvcore                  0.1.3.post20210311 
gdown                   3.12.2             
google-auth             1.24.0             
google-auth-oauthlib    0.4.2              
grpcio                  1.34.1             
idna                    2.10               
imageio                 2.9.0              
iopath                  0.1.2              
ipykernel               5.5.0              
ipython                 7.21.0             
ipython-genutils        0.2.0              
ipywidgets              7.6.3              
itsdangerous            1.1.0              
jedi                    0.18.0             
Jinja2                  2.11.3             
joblib                  1.0.0              
jsonschema              3.2.0              
jupyter                 1.0.0              
jupyter-client          6.1.12             
jupyter-console         6.3.0              
jupyter-core            4.7.1              
jupyterlab-pygments     0.1.2              
jupyterlab-widgets      1.0.0              
kiwisolver              1.3.1              
Markdown                3.3.3              
MarkupSafe              1.1.1              
matplotlib              3.3.3              
mistune                 0.8.4              
mock                    4.0.3              
nbclient                0.5.3              
nbconvert               6.0.7              
nbformat                5.1.2              
nest-asyncio            1.5.1              
networkx                2.5                
neural-renderer-pytorch 1.1.3              
notebook                6.3.0              
numpy                   1.19.5             
oauthlib                3.1.0              
omegaconf               2.0.6              
opencv-python           4.5.1.48           
packaging               20.9               
pandocfilters           1.4.3              
parso                   0.8.1              
pexpect                 4.8.0              
pickleshare             0.7.5              
Pillow                  8.1.0              
pip                     20.0.2             
pkg-resources           0.0.0              
portalocker             2.0.0              
prometheus-client       0.9.0              
prompt-toolkit          3.0.18             
protobuf                3.14.0             
ptyprocess              0.7.0              
pyasn1                  0.4.8              
pyasn1-modules          0.2.8              
pycocotools             2.0.2              
pycparser               2.20               
pydot                   1.4.1              
Pygments                2.8.1              
PyOpenGL                3.1.5              
pyparsing               2.4.7              
pyrsistent              0.17.3             
PySocks                 1.7.1              
python-dateutil         2.8.1              
PyWavelets              1.1.1              
PyYAML                  5.3.1              
pyzmq                   22.0.3             
qtconsole               5.0.3              
QtPy                    1.9.0              
requests                2.25.1             
requests-oauthlib       1.3.0              
rsa                     4.7                
scikit-image            0.18.1             
scikit-learn            0.24.0             
scipy                   1.6.0              
Send2Trash              1.5.0              
setuptools              44.0.0             
six                     1.15.0             
sklearn                 0.0                
smplx                   0.1.26             
tabulate                0.8.7              
tensorboard             2.4.1              
tensorboard-plugin-wit  1.7.0              
termcolor               1.1.0              
terminado               0.9.3              
testpath                0.4.4              
threadpoolctl           2.1.0              
tifffile                2021.3.5           
torch                   1.8.1+cu111        
torchaudio              0.8.1              
torchgeometry           0.1.2              
torchvision             0.9.1+cu111        
tornado                 6.1                
tqdm                    4.56.0             
traitlets               5.0.5              
typing-extensions       3.7.4.3            
urllib3                 1.26.2             
wcwidth                 0.2.5              
webencodings            0.5.1              
Werkzeug                1.0.1              
wheel                   0.36.2             
widgetsnbextension      3.5.1              
yacs                    0.1.8              
(phosa) mona@goku:~$ nvidia-smi
Mon Mar 29 15:16:05 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 165...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   37C    P8     5W /  N/A |   2165MiB /  3911MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1251      G   /usr/lib/xorg/Xorg                133MiB |
|    0   N/A  N/A      1848      G   /usr/lib/xorg/Xorg               1662MiB |
|    0   N/A  N/A      2087      G   /usr/bin/gnome-shell              187MiB |
|    0   N/A  N/A    163976      G   ...yb2R1Y3QiOiJjb20ubWljcm9z      163MiB |
+-----------------------------------------------------------------------------+

(phosa) mona@goku:~$ lsb_release -a
LSB Version:	core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.2 LTS
Release:	20.04
Codename:	focal
(phosa) mona@goku:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
(phosa) mona@goku:~$ python
Python 3.8.5 (default, Jan 27 2021, 15:41:15) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

does NVIDIA driver version 460.67 work with CUDA 11.2 and Python 3.6.8 and PyTorch 1.8.1+cu111 in CentOS 7?

For me NVIDIA driver 460.39 works with CUDA 11.2, and PyTorch 1.8.1+cu111 and Python 3.8.5 in Ubuntu 20.04

Both running the same exact code.

It should work as long as your driver is new enough to use the CUDA toolkit.
Based on your description I guess you are not able to run any workload on the CentOS7 machine or are other models (using cublas and cudnn) working fine?

1 Like

do you think the Python version is causing a problem here since NVIDIA driver, PyTorch, and CUDA are same across three machines:

Ubuntu 20.04 working
Ubuntu 18.04 not working
CentOS 7 not working
details: File "pose_optimization.py", line 287, in find_optimal_pose vertices=torch.matmul(vertices.unsqueeze(0), rotations_init), RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)` Β· Issue #29 Β· facebookresearch/phosa Β· GitHub

what workload should I be using? do you have a link to a workload you do suggest me to run that can perhaps cause similar problem?

someone ran Matrix inversion fails on GPU (google Colab) and had same exact problem as me but I don’t have a problem running this
I ran this and there was no problem

import torch
dim = 100
# CPU inversion
A = torch.rand(dim,dim,device='cpu')
Ainv = A.inverse()
print(torch.matmul(A,Ainv))

# GPU inversion
A = A.to('cuda')
Ainv = A.inverse()
print(torch.matmul(A,Ainv))

result was

tensor([[ 1.0000e+00,  6.5939e-06,  1.2953e-06,  ...,  5.2452e-06,
         -5.4836e-06,  1.6689e-06],
        [-2.6617e-06,  1.0000e+00,  1.5359e-06,  ...,  7.6294e-06,
          2.3842e-07,  2.0266e-06],
        [ 6.7785e-06,  6.1743e-06,  1.0000e+00,  ...,  7.1526e-06,
         -1.5497e-06, -4.7684e-07],
        ...,
        [-3.3288e-06,  1.0316e-06,  5.7282e-07,  ...,  1.0000e+00,
          1.1921e-06, -2.3842e-07],
        [ 7.4506e-07,  1.9073e-06,  7.1526e-07,  ...,  3.3379e-06,
          1.0000e+00,  2.8610e-06],
        [ 9.2387e-07, -1.0252e-05, -8.6427e-07,  ..., -4.5300e-06,
          7.0781e-06,  1.0000e+00]])
tensor([[ 1.0000e+00,  2.3842e-07,  5.9605e-08,  ...,  2.3842e-07,
          0.0000e+00,  1.1921e-07],
        [ 5.9605e-08,  1.0000e+00, -3.5763e-07,  ...,  2.3842e-07,
         -5.9605e-08, -2.3842e-07],
        [-2.8312e-07,  2.7418e-06,  1.0000e+00,  ...,  1.7881e-07,
          2.9802e-07,  1.7881e-07],
        ...,
        [ 2.6822e-07,  2.3842e-07,  2.3842e-07,  ...,  1.0000e+00,
          4.4703e-07,  6.5565e-07],
        [ 5.9605e-08,  2.3842e-07,  3.5763e-07,  ...,  0.0000e+00,
          1.0000e+00, -2.3842e-07],
        [ 1.1921e-07,  1.7881e-06, -1.1921e-07,  ..., -5.9605e-07,
          4.4703e-07,  1.0000e+00]], device='cuda:0')

In CentOS 7, installed Python 3.8.5 and installed the version of PyTorch 1.8.1 that works with CUDA 10.2, and the problem is resolved. Running with batch size 128 below.

(phosa) [jalal@goku phosa]$ python demo.py --filename input/bat_sidehold.jpg --class_name bat
2021-03-31 03:41:53,935 INFO     Calling with args: Namespace(class_name='bat', filename='input/bat_sidehold.jpg', lw_collision=None, lw_depth=None, lw_inter=None, lw_inter_part=None, lw_scale=None, lw_scale_person=None, lw_sil=None, mesh_index=0, output_dir='output')
2021-03-31 03:41:56,659 INFO     Loading checkpoint from detectron2://PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-31 03:41:56,666 INFO     URL https://dl.fbaipublicfiles.com/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl cached in /home/grad3/jalal/.torch/fvcore_cache/detectron2/PointRend/InstanceSegmentation/pointrend_rcnn_R_50_FPN_3x_coco/164955410/model_final_3c3198.pkl
2021-03-31 03:41:56,806 INFO     Reading a file from 'Detectron2 Model Zoo'
WARNING: You are using a SMPL model, with only 10 shape coefficients.
class_name:  bat
  0%|                                                                                                    | 0/800.0 [00:00<?, ?it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 2.12e+03:   6%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹                                                                      | 50/800.0 [00:07<01:51,  6.74it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.94e+03:  12%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                                                                | 100/800.0 [00:14<01:36,  7.25it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.94e+03:  19%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰                                                            | 150/800.0 [00:23<01:53,  5.73it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.94e+03:  25%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                                                       | 200/800.0 [00:31<01:27,  6.88it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.94e+03:  31%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                                  | 250/800.0 [00:38<01:18,  6.97it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.94e+03:  38%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š                                              | 300/800.0 [00:45<01:11,  6.99it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.89e+03:  44%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–                                         | 350/800.0 [00:52<01:02,  7.23it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.89e+03:  50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                     | 400/800.0 [01:00<00:55,  7.18it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.89e+03:  56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹                                | 450/800.0 [01:07<00:48,  7.18it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.89e+03:  62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                           | 500/800.0 [01:14<00:40,  7.36it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.89e+03:  69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰                       | 550/800.0 [01:21<00:34,  7.27it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.89e+03:  75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                  | 600/800.0 [01:28<00:30,  6.64it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.89e+03:  81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–             | 650/800.0 [01:36<00:22,  6.81it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.89e+03:  88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š         | 700/800.0 [01:44<00:15,  6.28it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.84e+03:  94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 750/800.0 [01:51<00:06,  7.16it/s]PoseOptimizer(
  (pool): MaxPool2d(kernel_size=7, stride=1, padding=3, dilation=1, ceil_mode=False)
  (renderer): Renderer()
)
loss: 1.84e+03: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 800/800.0 [01:58<00:00,  6.73it/s]
Loss 8.3427: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 400/400 [02:04<00:00,  3.21it/s]
2021-03-31 03:46:01,843 INFO     Saved rendered image to output/bat_sidehold.jpg.
2021-03-31 03:46:01,854 INFO     Saved top-down image to output/bat_sidehold_top.jpg.