Bn+ cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Hi, I have a problem with running my code on a gpu. I am working on a remote server. We have several GPUs that of some of them are bigger. When I run my code on other gpus it runs correctly and I do not have any problem. But when I try the big that has: Memory: 60.63 GiB / 503.78 GiB (12.03%)

I would get the following error. I made a new-env and reinstalled pytorch but did not help:
return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I diactivated batch norm, resized the batch size but none of them helped.

I need to run my script on this Gpu. Would you please help me to solve this.

Which GPU are you using? Could you disable cudnn via torch.backends.cudnn.enabled = False and rerun your script? If that works, could you post the batchnorm setup as well as the input shapes?

Also, could you run python -m torch.utils.collect_env and post the output here?

2 Likes

Can you tell me which GPU you are using. also, share which GPU is available or not. It may be related to version issues or libraries issues.

Hi, I am using : A100-SXM4-40GB Gpu and I tried to set torch.backends.cudnn.enabled = False, but it did not help.

And these are the information that I got from python -m torch.utils.collect_env

PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB

Nvidia driver version: 460.32.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.8.1
[pip3] torchvision==0.9.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.3.0 py38h54f3939_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] pytorch 1.8.1 py3.8_cuda10.2_cudnn7.6.5_0 pytorch
[conda] torchvision 0.9.1 py38_cu102 pytorch

Something that I can see by running nvidia-smi is like that:

Thu Apr 29 12:29:34 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| N/A 56C P0 169W / 400W | 12738MiB / 40536MiB | 100% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 61C P0 338W / 400W | 13131MiB / 40536MiB | 98% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 2 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 48C P0 225W / 400W | 12608MiB / 40536MiB | 100% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 3 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
| N/A 52C P0 324W / 400W | 12590MiB / 40536MiB | 98% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 102552 C python 12735MiB |
| 1 N/A N/A 149282 C python 12587MiB |
| 1 N/A N/A 185753 C python 541MiB |
| 2 N/A N/A 31986 C python 12605MiB |
| 3 N/A N/A 184443 C python 12587MiB |
±----------------------------------------------------------------------------+

The CUDA version on GPU is 11.2 and CUDA used to build PyTorch: 10.2 , could it be the problem?

Should I upgrade the CUDA version to 11.2?

I am using A100-SXM4-40GB, and all of GPUS are available. Indeed in our group we have 4 gpus of this type and I am running my script on the remote server.

See, I have v100 32GB, I have Nvidia nvidia 460.
and I have installed using conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch nvidia
in conda environment.
Nothing I have to install outside conda environment. Try this in the new environment, hope this work.

I installed pytorch torchvision torchaudio cudatoolkit=11.2, dose this also work?

I mean cudatoolkit=11.2 instead of cudatoolkit=11.1?

I am not sure about it. But for me the 11.1 works.
check here

11.2 also worked. Thank you. :slight_smile:

Hi, yesterday when I created a new conda env and then I installed pytorch with cudatoolkit =11.2, it worked, but a half an hour later, when I tried it again it did not work. I got this error message when I was trying to install my own package:

WARNING: Value for scheme.headers does not match. Please report this to <https://github.com/pypa/pip/issues/9617> distutils: /home/envs/vir-env4/include/python3.9/UNKNOWN sysconfig: /home/anaconda3/envs/vir-env4/include/python3.9 WARNING: Additional context: user = False home = None root = None prefix = None Obtaining file:///mnt/home/Baysian Installing collected packages: Baysian-Seg Running setup.py develop for Baysian-Seg WARNING: Value for scheme.headers does not match. Please report this to <https://github.com/pypa/pip/issues/9617> distutils: /home/anaconda3/envs/vir-env4/include/python3.9/UNKNOWN sysconfig: /home/anaconda3/envs/vir-env4/include/python3.9 WARNING: Additional context: user = False home = None root = None prefix = None

Could the error here problematic?

Not sure about the error.
But I guess this error is not related to Pytorch.
Try to run this simple example
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

if still got an error in the above example. then there might be an error in installation.

Else the error is related to something else.

Ok, thank you for the clue. I would do that.

Hey I did that. It is really working for both simple classifier and my code as well.

But it is super slow. very slow even for the classifier. Do you know what could be the reason ?