Hi,
I am getting this error when trying to run my model training on cloud machine:
File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/nn/functional.py", line 2438, in batch_norm
return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
The code is running fine in the local machine though. The local machine has less memory than the remote one, so it is not an issue of being out of memory.
I think it is also not the issue with CUDA versions, they seem to be aligned, python -m torch.utils.collect_env
shows:
PyTorch version: 1.12.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.15.5
Libc version: glibc-2.31
Python version: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1035-azure-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: Tesla K80
Nvidia driver version: 470.182.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.21.6
[pip3] pytorch-transformers==1.0.0
[pip3] torch==1.12.0+cu113
[pip3] torchaudio==0.12.0
[pip3] torchmetrics==0.10.2
[pip3] torchvision==0.13.0
[conda] _pytorch_select 0.1 cpu_0 anaconda
[conda] blas 1.0 mkl anaconda
[conda] cudatoolkit 11.3.1 h2bc3f7f_2
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] numpy 1.21.6 py38h1d589f8_0 conda-forge
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-transformers 1.0.0 pypi_0 pypi
[conda] torch 1.12.0+cu113 pypi_0 pypi
[conda] torchaudio 0.12.0 py38_cu113 pytorch
[conda] torchmetrics 0.10.2 pypi_0 pypi
[conda] torchvision 0.13.0 py38_cu113 pytorch
What could be other possible reasons for this error that I could check?