CUDNN_STATUS_MAPPING_ERROR using conv2d

pelias · November 12, 2021, 8:17am

Would be immensely grateful for some help. New GPUs not working on my conda environment

In [2]: torch.version.cuda
Out[2]: ‘10.1.243’

cudatoolkit 10.1.243 h6bb024c_0 anaconda
cudnn 7.6.5.32 hc0a50b0_1 conda-forge
python 3.7.5 h0371630_0 anaconda
python-dateutil 2.8.0 py37_0 anaconda
python_abi 3.7 1_cp37m conda-forge
pytorch 1.3.1 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
torchsummary 1.5.1 pypi_0 pypi
torchvision 0.4.2 py37_cu101 pytorch

i tried updating my pytorch install within my conda environment using

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

as well as updating cudnn
and thus far no luck

GPUs:
NVIDIA GeForce RTX 3090, 00000000:1A:00.0, 94.02.42.00.B0
NVIDIA GeForce RTX 3090, 00000000:1B:00.0, 94.02.42.00.B0
NVIDIA GeForce RTX 3090, 00000000:3D:00.0, 94.02.42.00.B0
NVIDIA GeForce RTX 3090, 00000000:3E:00.0, 94.02.42.00.B0

NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3


(porch) **pae2115@dendrite**:**~/ValveNet_v2**$ CUDA_IDX=0 python CADnet_ECG_001.py

View CADnet_ECG_ONLY_001_2021-11-12--07:54:27 for study log dir

[I 2021-11-12 07:54:27,246] A new study created in memory with name: CADnet_ECG_ONLY_001

data_config <OptunaDataConfig(batch_size=16, shuffle=True, train_sampler=None, validation_sampler=None, data_type=2, pin_memory=True, train_features_path=/home/pae2115/CADnet/amyloid_any_typ

e_ECG_plus_echo_train_wander_removed_pct_truncated_mean_normalized_waveform_features.npy, train_labels_path=/home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_train_label.npy, train_tabular

_path=None, validation_features_path=/home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_eval_wander_removed_pct_truncated_mean_normalized_waveform_features.npy, validation_labels_path=/home

/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_eval_label.npy, validation_tabular_path=None, features_permute_axes_order=[], labels_permute_axes_order=[]) @12fd0>

Loading features data from /home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_train_wander_removed_pct_truncated_mean_normalized_waveform_features.npy

Loading labels data from /home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_train_label.npy

Loading tabular data from None

result shape before permute_axes_order (7364, 1, 2500, 12)

result shape before permute_axes_order (7364,)

Done loading features data from /home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_train_wander_removed_pct_truncated_mean_normalized_waveform_features.npy (7364, 1, 2500, 12)

Done loading labels data from /home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_train_label.npy (7364,)

Loading features data from /home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_eval_wander_removed_pct_truncated_mean_normalized_waveform_features.npy

Loading labels data from /home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_eval_label.npy

Loading tabular data from None

result shape before permute_axes_order (2161, 1, 2500, 12)

result shape before permute_axes_order (2161,)

Done loading features data from /home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_eval_wander_removed_pct_truncated_mean_normalized_waveform_features.npy (2161, 1, 2500, 12)
Done loading labels data from /home/pae2115/CADnet/amyloid_any_type_ECG_plus_echo_eval_label.npy (2161,)

Computing train label counts
Computing eval label counts

train label count: Counter({0.0: 5611, 1.0: 1753})
 eval label count: Counter({0.0: 1468, 1.0: 693})
POSITIVE_CLASS_WEIGHT: tensor([3.2008], dtype=torch.float64)  
High class imbalance in train, adding positive label weight(s): tensor([3.2008], dtype=torch.float64)
Optuna Train on: cuda:0
trial params dict_items([('batch_size', 16), ('lr', 5e-05), ('weight_decay', 0.01), ('optimizer_name', 'Adam'), ('filter_size', 16), ('dropout', 0.5)])
checkpoint_log_dir CADnet_ECG_ONLY_001_2021-11-12--07:54:27/trial_batch_size=16,lr=5e-05,weight_decay=0.01,optimizer_name=Adam,filter_size=16,dropout=0.5,
loss is on cuda:0 with pos weight tensor([3.2008], device='cuda:0', dtype=torch.float64)
2021-11-12 08:03:37,966 trainer INFO: Engine run starting with max_epochs=80.
2021-11-12 08:03:38,363 trainer ERROR: Current run is terminating due to exception: cuDNN error: CUDNN_STATUS_MAPPING_ERROR.
2021-11-12 08:03:38,364 trainer ERROR: Engine run is terminating due to exception: cuDNN error: CUDNN_STATUS_MAPPING_ERROR.
[W 2021-11-12 08:03:38,364] Trial 0 failed because of the following error: RuntimeError('cuDNN error: CUDNN_STATUS_MAPPING_ERROR')
Traceback (most recent call last):
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "/home/pae2115/ValveNet_v2/optuna_common/objective.py", line 110, in __call__
    trainer.run(train_data_loader, max_epochs=self.model_config.num_epochs)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 691, in run
    return self._internal_run()
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 762, in _internal_run
    self._handle_exception(e)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 730, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 828, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 811, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/__init__.py", line 99, in _update
    y_pred = model(x)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/pae2115/ValveNet_v2/models/optuna_pytorch_models.py", line 84, in forward
    x = self.conv1(x)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Traceback (most recent call last):
  File "CADnet_ECG_001.py", line 111, in <module>
    redirect=args.redirect
  File "/home/pae2115/ValveNet_v2/optuna_common/trial_utils.py", line 50, in run_trials
    study.optimize(objective, n_trials=num_trials, gc_after_trial=True)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/optuna/study/study.py", line 409, in optimize
    show_progress_bar=show_progress_bar,
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/optuna/study/_optimize.py", line 76, in _optimize
    progress_bar=progress_bar,
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/optuna/study/_optimize.py", line 163, in _optimize_sequential
    trial = _run_trial(study, func, catch)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/optuna/study/_optimize.py", line 264, in _run_trial
    raise func_err
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "/home/pae2115/ValveNet_v2/optuna_common/objective.py", line 110, in __call__
    trainer.run(train_data_loader, max_epochs=self.model_config.num_epochs)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 691, in run
    return self._internal_run()
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 762, in _internal_run
    self._handle_exception(e)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 730, in _internal_run
    time_taken = self._run_once_on_dataset()
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 828, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
    raise e
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/engine.py", line 811, in _run_once_on_dataset
    self.state.output = self._process_function(self, self.state.batch)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/ignite/engine/__init__.py", line 99, in _update
    y_pred = model(x)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/pae2115/ValveNet_v2/models/optuna_pytorch_models.py", line 84, in forward
    x = self.conv1(x)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/pae2115/anaconda3/envs/porch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

ptrblck · November 12, 2021, 8:30am

Your 3090 needs CUDA11, so you should install the current binaries with the CUDA11.3 runtime and cuDNN8.2.0.

pelias · November 12, 2021, 9:41am

This is incredibly helpful but I’m not entirely sure how to do this, could you point me to some documentation on how to do so? Thank you for always being so helpful!

Best,
Pierre

ptrblck · November 12, 2021, 9:48am

Sure!
Check the install instructions here and select “CUDA 11.3”:

Make sure to uninstall the current PyTorch build in the used environment (run pip uninstall torch -y a few times and/or conda uninstall pytorch -y) or create a new virtual environment before installing the new binaries.

pelias · November 12, 2021, 7:10pm

Thank you for the additional detail. I did this in my conda environment and then both torch and ignite become uninstalled (after running the conda install function you list above). I then run

pip install torch

then

conda install -c pytorch ignite

The latter does this

==> WARNING: A newer version of conda exists. <==
  current version: 4.10.1
  latest version: 4.10.3

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/pae2115/anaconda3/envs/porch

  added / updated specs:
    - ignite


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ignite-0.4.7               |             py_0         129 KB  pytorch
    pytorch-1.7.1              |py3.7_cuda10.1.243_cudnn7.6.3_0       552.8 MB  pytorch
    ------------------------------------------------------------
                                           Total:       552.9 MB

The following NEW packages will be INSTALLED:

  ignite             pytorch/noarch::ignite-0.4.7-py_0
  libuv              pkgs/main/linux-64::libuv-1.40.0-h7b6447c_0
  **pytorch            pytorch/linux-64::pytorch-1.7.1-py3.7_cuda10.1.243_cudnn7.6.3_0**

It seems as though ignite overwrites to an older version of pytorch. I cant find any documentation online about pytorch ignite compatibility with CUDA 11

ptrblck · November 12, 2021, 7:40pm

It seems that ignite is indeed pointing towards an older PyTorch release with CUDA10.
Note that pip install torch would also install the default CUDA 10.2 runtime, so stick to the CUDA 11.3 pip install command.

In any case, once you’ve properly installed a PyTorch binaries with CUDA 11.3, try to skip the dependencies in ignite via conda install -c pytorch --no-deps.

CC @vfdev-5 for visibility in case this is a known issue in ignite

pelias · November 12, 2021, 8:07pm

i attempted uninstalling pytorch and ignite followed by

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch-nightly

but after this there was no version of torch installed. Further when I check list cuda I see

cudatoolkit               10.1.243             h6bb024c_0    anaconda

I also tried

pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

instead without luck.

I am now trying to build an environment from scratch.

And apologies for my limited understanding, but you are saying that once I get over the hurdle of pytorch installed with CUDA 11.3, then when I go to install ignite I should run

conda install -c pytorch ignite --no-deps

I was not sure as the conda install line you share does not have ignite in it.

pelias · November 12, 2021, 8:54pm

Got it working! Had to abandon the old environment and build from scratch. This ordering (followed by all the other packages) led to success. Thank you so much for your help!

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

conda list cuda

conda install -c pytorch ignite --no-deps

vfdev-5 · November 12, 2021, 9:15pm

@pelias interesting issue you are facing.
I just tried to install on a new conda env based on python 3.7 and could not repro the issue:

(test) user:/# conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/test

  added / updated specs:
    - cudatoolkit=11.3
    - pytorch
    - torchvision


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    cudatoolkit-11.3.1         |       h2bc3f7f_2       549.3 MB
    freetype-2.11.0            |       h70c0345_0         618 KB
    giflib-5.2.1               |       h7b6447c_0          78 KB
    intel-openmp-2021.4.0      |    h06a4308_3561         4.2 MB
    jpeg-9d                    |       h7f8727e_0         232 KB
    libidn2-2.3.2              |       h7f8727e_0          81 KB
    libwebp-1.2.0              |       h89dd481_0         493 KB
    lz4-c-1.9.3                |       h295c915_1         185 KB
    mkl-2021.4.0               |     h06a4308_640       142.6 MB
    mkl-service-2.4.0          |   py37h7f8727e_0          56 KB
    mkl_fft-1.3.1              |   py37hd3c417c_0         172 KB
    mkl_random-1.2.2           |   py37h51133e4_0         287 KB
    numpy-1.21.2               |   py37h20f2e39_0          23 KB
    numpy-base-1.21.2          |   py37h79a1101_0         4.8 MB
    olefile-0.46               |           py37_0          50 KB
    pillow-8.4.0               |   py37h5aabda8_0         644 KB
    pytorch-1.10.0             |py3.7_cuda11.3_cudnn8.2.0_0        1.21 GB  pytorch
    pytorch-mutex-1.0          |             cuda           3 KB  pytorch
    torchvision-0.11.1         |       py37_cu113        30.3 MB  pytorch
    typing_extensions-3.10.0.2 |     pyh06a4308_0          31 KB
    ------------------------------------------------------------
                                           Total:        1.92 GB

and

(test) user:/# conda install ignite -c pytorch
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/test

  added / updated specs:
    - ignite

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ignite-0.4.7               |             py_0         129 KB  pytorch
    ------------------------------------------------------------
                                           Total:         129 KB

The following NEW packages will be INSTALLED:

  ignite             pytorch/noarch::ignite-0.4.7-py_0

Finally, I have

(test) user:/# conda list | grep pytorch
ffmpeg                    4.3                  hf484d3e_0    pytorch
ignite                    0.4.7                      py_0    pytorch
pytorch                   1.10.0          py3.7_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchvision               0.11.1               py37_cu113    pytorch

I wonder if it is not related to the usage of conda-forge channel …

PS @ptrblck thanks for pinging !

ptrblck · November 12, 2021, 10:21pm

pelias:

And apologies for my limited understanding, but you are saying that once I get over the hurdle of pytorch installed with CUDA 11.3, then when I go to install ignite I should run
conda install -c pytorch ignite --no-deps
I was not sure as the conda install line you share does not have ignite in it.

Yes, sorry for dropping the ignite in the install command.

Good to hear it’s working now.

conda-forge shouldn’t be needed, but you might be right that some version mismatches might be caused by a channel mix.