RuntimeError: cuDNN error: CUDNN_STATUS_ARCH_MISMATCH

rahatsantosh · November 29, 2023, 4:25pm

Am getting the following error:

Traceback (most recent call last):
  File "/home/kitoo/mlpractical/pytorch_mlp_framework/train_evaluate_image_classification_system.py", line 74, in <module>
    experiment_metrics, test_metrics = conv_experiment.run_experiment()  # run experiment and return experiment metrics
  File "/home/kitoo/mlpractical/pytorch_mlp_framework/experiment_builder.py", line 258, in run_experiment
    loss, accuracy = self.run_train_iter(x=x, y=y)  # take a training iter step
  File "/home/kitoo/mlpractical/pytorch_mlp_framework/experiment_builder.py", line 182, in run_train_iter
    out = self.model.forward(x)  # forward the data in the model
  File "/home/kitoo/mlpractical/pytorch_mlp_framework/model_architectures.py", line 319, in forward
    out = self.layer_dict['input_conv'].forward(out)
  File "/home/kitoo/mlpractical/pytorch_mlp_framework/model_architectures.py", line 138, in forward
    out = self.layer_dict['conv_0'].forward(out)
  File "/opt/conda/envs/mlp/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/envs/mlp/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_ARCH_MISMATCH

The issue is, the same code was working till a few days back. I was running this on a GCP compute instance, and had to delete and recreate the same instance from the same image. This issue has now been occurring post creating the new instance.

Using Tesla K80 GPU, and Nvidia driver:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    73W / 149W |      0MiB / 11441MiB |     73%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Pytorch - 2.1.1 and Cuda 11.8.

ptrblck · November 29, 2023, 4:44pm

What changed in the last few days in your setup?

rahatsantosh · November 29, 2023, 5:31pm

The setup seems to be the same, as far as I can notice, since I used the same base image to create both the instances. Though I had to reinstall pytorch in each instance manually.

Is there anything specific that I should be checking for, that could cause this issue in the system?

ptrblck · November 29, 2023, 6:24pm

You could use LD_DEBUG=libs to check if a newly installed cuDNN package is interfering with your PyTorch build. This should show which libcudnn*.so* is loaded and might indicate conflicts.