RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED - not OOM or version issue

Hi,
I am getting this error when trying to run my model training on cloud machine:

File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/functional.py", line 2438, in batch_norm
    return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

The code is running fine in the local machine though. The local machine has less memory than the remote one, so it is not an issue of being out of memory.
I think it is also not the issue with CUDA versions, they seem to be aligned, python -m torch.utils.collect_env shows:

PyTorch version: 1.12.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.15.5
Libc version: glibc-2.31

Python version: 3.8.5 (default, Sep  4 2020, 07:30:14)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1035-azure-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: Tesla K80
Nvidia driver version: 470.182.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.21.6
[pip3] pytorch-transformers==1.0.0
[pip3] torch==1.12.0+cu113
[pip3] torchaudio==0.12.0
[pip3] torchmetrics==0.10.2
[pip3] torchvision==0.13.0
[conda] _pytorch_select           0.1                       cpu_0    anaconda
[conda] blas                      1.0                         mkl    anaconda
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] numpy                     1.21.6           py38h1d589f8_0    conda-forge
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-transformers      1.0.0                    pypi_0    pypi
[conda] torch                     1.12.0+cu113             pypi_0    pypi
[conda] torchaudio                0.12.0               py38_cu113    pytorch
[conda] torchmetrics              0.10.2                   pypi_0    pypi
[conda] torchvision               0.13.0               py38_cu113    pytorch

What could be other possible reasons for this error that I could check?

Could you update PyTorch to the latest stable or nightly release and check if you would still see the issue?

Ok, but that means I need to change the CUDA Toolkit version as well to 11.7 or 11.8, right?

No, since the PyTorch binaries ship with their own CUDA dependencies and your locally installed CUDA toolkit will be used if you are building PyTorch from source or a custom CUDA extension.

1 Like

ah I see, thanks a lot!