cuDNN error: CUDNN_STATUS_EXECUTION FAILED Pytoch v1.1

Hello everyone!

I am running a code from an online repository link. I couldn’t copy and paste the whole file, as it is a huge file, but I can if required. When I run the code, I face the above-mentioned error:

cuDNN error: CUDNN_STATUS_EXECUTION FAILED

I use the following tools:

CudaToolkit 10.0
Pytorch v1.1.0
Torchvision v0.3.0
Python v3.7
GCC-8.4
Ubuntu 22.04
Anaconda 23.3.1

Here is the complete error:

python -m torch.distributed.launch --nproc_per_node=1 train_det_seg_OCID.py --log_dir=LOGDIR --config="/home/ozaland/newProj/grasp_det_seg_cnn/grasp_det_seg/config/defaults/det_seg_OCID.ini" --data="/home/ozaland/newProj/grasp_det_seg_cnn/DATA/OCID_grasp/data_split"
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
14:52:31 Loading configuration from /home/ozaland/newProj/grasp_det_seg_cnn/grasp_det_seg/config/defaults/det_seg_OCID.ini
14:52:31 
[general]
val_interval = 25
log_interval = 10
cudnn_benchmark = no
num_classes = 18
num_stuff = 0
num_things = 18
num_semantic = 32

[body]
body = resnet101
weights = /home/ozaland/newProj/grasp_det_seg_cnn/weights_pretrained/resnet101
normalization_mode = syncbn
activation = leaky_relu
activation_slope = 0.01
gn_groups = 16
body_params = {}
num_frozen = 2
bn_frozen = yes
out_channels = {"mod1": 64, "mod2": 256, "mod3": 512, "mod4": 1024, "mod5": 2048}
out_strides = {"mod1": 4, "mod2": 4, "mod3": 8, "mod4": 16, "mod5": 32}

[fpn]
out_channels = 256
extra_scales = 0
interpolation = nearest
inputs = ["mod2", "mod3", "mod4", "mod5"]
out_strides = (4, 8, 16, 32)

[rpn]
hidden_channels = 256
stride = 1
anchor_ratios = (1., 0.1, 0.4, 0.7, 1.2)
anchor_scale = 2
nms_threshold = 0.7
num_pre_nms_train = 12000
num_post_nms_train = 2000
num_pre_nms_val = 6000
num_post_nms_val = 300
min_size = 16
num_samples = 256
pos_ratio = .5
pos_threshold = .7
neg_threshold = .3
void_threshold = 0.7
fpn_min_level = 0
fpn_levels = 3
sigma = 3.

[roi]
roi_size = (14, 14)
num_samples = 128
pos_ratio = .25
pos_threshold = .5
neg_threshold_hi = .5
neg_threshold_lo = 0.
void_threshold = 0.7
void_is_background = no
nms_threshold = 0.3
score_threshold = 0.05
max_predictions = 100
fpn_min_level = 0
fpn_levels = 4
fpn_canonical_scale = 224
fpn_canonical_level = 2
sigma = 1.
bbx_reg_weights = (10., 10., 5., 5.)

[sem]
fpn_min_level = 0
fpn_levels = 4
pooling_size = (64, 64)
ohem = .25

[optimizer]
lr = 0.03
weight_decay = 0.0001
weight_decay_norm = yes
momentum = 0.9
nesterov = yes
loss_weights = (1., 1., 1., 1.,.75)

[scheduler]
epochs = 800
type = poly
update_mode = batch
params = {"gamma": 0.9}
burn_in_steps = 500
burn_in_start = 0.333

[dataloader]
root_path = /home/ozaland/newProj/grasp_det_seg_cnn/DATA/OCID_grasp
shortest_size = 480
longest_max_size = 640
train_batch_size = 10
val_batch_size = 1
rgb_mean = (0.485, 0.456, 0.406)
rgb_std = (0.229, 0.224, 0.225)
random_flip = no
random_scale = None
rotate_and_scale = True
num_workers = 6
train_set = training_0
val_set = validation_0
test_set = validation_0


14:52:31 Creating dataloaders for dataset in /home/ozaland/newProj/grasp_det_seg_cnn/DATA/OCID_grasp/data_split
14:52:31 Creating backbone model resnet101
14:53:19 Starting epoch 1
Traceback (most recent call last):
  File "train_det_seg_OCID.py", line 496, in <module>
    main(parser.parse_args())
  File "train_det_seg_OCID.py", line 467, in main
    global_step=global_step, loss_weights=config["optimizer"].getstruct("loss_weights"))
  File "train_det_seg_OCID.py", line 274, in train
    losses, _, conf = model(**batch, do_loss=True, do_prediction=False)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/grasp_det_seg/models/det_seg.py", line 71, in forward
    x = self.body(img)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/grasp_det_seg/modules/fpn.py", line 136, in forward
    x = self.backbone(x)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/grasp_det_seg/models/resnet.py", line 115, in forward
    outs["mod4"] = self.mod4(outs["mod3"])
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/grasp_det_seg/modules/residual.py", line 92, in forward
    x = self.convs(x) + residual
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/home/ozaland/anaconda3/envs/detgrasp/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
    cmd=process.args)
subprocess.CalledProcessError: Command '['/home/ozaland/anaconda3/envs/detgrasp/bin/python', '-u', 'train_det_seg_OCID.py', '--local_rank=0', '--log_dir=LOGDIR', '--config=/home/ozaland/newProj/grasp_det_seg_cnn/grasp_det_seg/config/defaults/det_seg_OCID.ini', '--data=/home/ozaland/newProj/grasp_det_seg_cnn/DATA/OCID_grasp/data_split']' returned non-zero exit status 1.

conda list returns the following output:


# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   1.4.0                    pypi_0    pypi
blas                      1.0                         mkl  
ca-certificates           2023.05.30           h06a4308_0  
certifi                   2022.12.7        py37h06a4308_0    anaconda
cffi                      1.15.0           py37h7f8727e_0  
cudatoolkit               10.0.130                      0    anaconda
cudnn                     7.6.5                cuda10.0_0  
expat                     2.4.9                h6a678d5_0  
freetype                  2.12.1               h4a9f257_0  
future                    0.18.2                   pypi_0    pypi
git                       2.19.1          pl526h7fee0ce_0    anaconda
gputil                    1.4.0                    pypi_0    pypi
graspdetseg-cnn           0.1.dev22+g1cfeb2f.d20230711          pypi_0    pypi
grpcio                    1.56.0                   pypi_0    pypi
imageio                   2.31.1                   pypi_0    pypi
importlib-metadata        6.7.0                    pypi_0    pypi
inplace-abn               1.1.1.dev7+gd7dd3e1          pypi_0    pypi
intel-openmp              2022.1.0          h9e868ea_3769  
jpeg                      9e                   h5eee18b_1  
lerc                      3.0                  h295c915_0  
libcurl                   7.61.1               heec0ca6_0  
libdeflate                1.17                 h5eee18b_0  
libedit                   3.1.20221030         h5eee18b_0  
libffi                    3.2.1             hf484d3e_1007  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libpng                    1.6.39               h5eee18b_0  
libssh2                   1.8.0                h9cfc8f7_4  
libstdcxx-ng              11.2.0               h1234567_1  
libtiff                   4.5.0                h6a678d5_2  
libwebp-base              1.2.4                h5eee18b_1  
lz4-c                     1.9.4                h6a678d5_0  
markdown                  3.4.3                    pypi_0    pypi
markupsafe                2.1.3                    pypi_0    pypi
mkl                       2020.2                      256  
mkl-service               2.3.0            py37he8ac12f_0  
mkl_fft                   1.3.0            py37h54f3939_0  
mkl_random                1.1.1            py37h0573a6f_0  
msgpack-python            0.5.6                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
networkx                  2.6.3                    pypi_0    pypi
ninja                     1.10.2               h06a4308_5  
ninja-base                1.10.2               hd09550d_5  
numpy                     1.21.6                   pypi_0    pypi
numpy-base                1.19.2           py37hfa32c7d_0  
olefile                   0.46                     py37_0  
opencv-contrib-python     4.8.0.74                 pypi_0    pypi
openssl                   1.0.2u               h7b6447c_0    anaconda
packaging                 23.1                     pypi_0    pypi
perl                      5.34.0               h5eee18b_1  
pillow                    6.1.0            py37h34e0f95_0    anaconda
pip                       22.3.1           py37h06a4308_0    anaconda
protobuf                  3.20.3                   pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
python                    3.7.0                h6e4f718_3  
pytorch                   1.1.0           cuda100py37he554f03_0  
pywavelets                1.3.0                    pypi_0    pypi
readline                  7.0                  h7b6447c_5  
scikit-image              0.19.3                   pypi_0    pypi
scipy                     1.7.3                    pypi_0    pypi
setuptools                65.6.3           py37h06a4308_0  
shapely                   1.7.0                    pypi_0    pypi
six                       1.16.0             pyhd3eb1b0_1  
sqlite                    3.33.0               h62c20be_0  
tensorboard               1.14.0                   pypi_0    pypi
tifffile                  2021.11.2                pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
torchvision               0.3.0           cuda100py37h72fc40a_0  
typing-extensions         4.7.1                    pypi_0    pypi
umsgpack                  0.1.0                    pypi_0    pypi
werkzeug                  2.2.3                    pypi_0    pypi
wheel                     0.38.4           py37h06a4308_0  
xz                        5.4.2                h5eee18b_0  
zipp                      3.15.0                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_0  
zstd                      1.5.5                hc292b87_0

Nvidia-smi output the following

Mon Jul 24 15:15:33 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               Off | 00000000:01:00.0  On |                  Off |
| 41%   42C    P5              24W / 140W |   1267MiB / 16376MiB |     34%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      6292      G   /usr/lib/xorg/Xorg                          450MiB |
|    0   N/A  N/A      6503      G   /usr/bin/gnome-shell                        123MiB |
|    0   N/A  N/A    189952      G   ...4848331,14781432929507757159,262144      364MiB |
|    0   N/A  N/A    946565      G   ...,WinRetrieveSuggestionsOnlyOnDemand       76MiB |
|    0   N/A  N/A   1262567      G   ...sion,SpareRendererForSitePerProcess      166MiB |
+---------------------------------------------------------------------------------------+

I am running the file with the command that is provided in the repository. i.e.` python -m torch.distributed.launch --nproc_per_node=1 FileName.py --log_dir=LOGDIR --config=CONFIGFILE --data=DATAFOLDER

I have done the following

  • Reinstalled the CUDA toolkit multiple times, even purged all the Nvidia graphic drivers multiple times, and it still didn’t work.
  • The problem is not because of OOM, as many in the forums suggested. As I am keeping track of the memory usage, and the max GPU memory usage is around 14% for now.

If anyone can provide any assistance, it would be great, as I have been stuck on this for a month now, and no solution seems to work.

You are using an old PyTorch release with CUDA 10.0, which does not support your Ampere GPU.
Update to the latest stable or nightly release with any CUDA 11.x or 12.1 runtime and it should work.

Thank you. @ptrblck

I now installed Pytorch v1.12 with CUDA 11.7.
When I am running the python setup.py build command, it gives the following error:

 ********************************************************************************
              Requirements should be satisfied by a PEP 517 installer.
              If you are using pip, you can try `pip install --use-pep517`.
              ********************************************************************************

And when I am using the --use-pep517 flag, it says that torch is not installed. When I have torch installed in the environment, my initial thought was the with --use-pep517 flag, it would force pip to use another environment different than the anaconda environment that I am using, where torch is installed. But upon some digging, my initial assumption was wrong.

If you have any idea about that, it would be helpful.

I don’t know what exactly raises the error, but would recommend sticking to the install commands given here. A simple pip install torch command will download torch==2.0.1+cu117 with all needed CUDA 11.7 dependencies and will work for your device.