Bn+ cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

887574002 · April 15, 2021, 5:55pm

Hi, I have a problem with running my code on a gpu. I am working on a remote server. We have several GPUs that of some of them are bigger. When I run my code on other gpus it runs correctly and I do not have any problem. But when I try the big that has: Memory: 60.63 GiB / 503.78 GiB (12.03%)

I would get the following error. I made a new-env and reinstalled pytorch but did not help:
return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I diactivated batch norm, resized the batch size but none of them helped.

I need to run my script on this Gpu. Would you please help me to solve this.

ptrblck · April 17, 2021, 7:46am

Which GPU are you using? Could you disable cudnn via torch.backends.cudnn.enabled = False and rerun your script? If that works, could you post the batchnorm setup as well as the input shapes?

Also, could you run python -m torch.utils.collect_env and post the output here?

Kapil_Rana · April 17, 2021, 1:13pm

Can you tell me which GPU you are using. also, share which GPU is available or not. It may be related to version issues or libraries issues.

887574002 · April 29, 2021, 12:35pm

Hi, I am using : A100-SXM4-40GB Gpu and I tried to set torch.backends.cudnn.enabled = False, but it did not help.

And these are the information that I got from python -m torch.utils.collect_env

PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB

Nvidia driver version: 460.32.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.8.1
[pip3] torchvision==0.9.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.3.0 py38h54f3939_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] pytorch 1.8.1 py3.8_cuda10.2_cudnn7.6.5_0 pytorch
[conda] torchvision 0.9.1 py38_cu102 pytorch

Something that I can see by running nvidia-smi is like that:

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 102552 C python 12735MiB |
| 1 N/A N/A 149282 C python 12587MiB |
| 1 N/A N/A 185753 C python 541MiB |
| 2 N/A N/A 31986 C python 12605MiB |
| 3 N/A N/A 184443 C python 12587MiB |
±----------------------------------------------------------------------------+

The CUDA version on GPU is 11.2 and CUDA used to build PyTorch: 10.2 , could it be the problem?

Should I upgrade the CUDA version to 11.2?

887574002 · April 29, 2021, 12:41pm

I am using A100-SXM4-40GB, and all of GPUS are available. Indeed in our group we have 4 gpus of this type and I am running my script on the remote server.

Kapil_Rana · April 29, 2021, 12:59pm

See, I have v100 32GB, I have Nvidia nvidia 460.
and I have installed using conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch nvidia
in conda environment.
Nothing I have to install outside conda environment. Try this in the new environment, hope this work.

887574002 · April 29, 2021, 1:02pm

I installed pytorch torchvision torchaudio cudatoolkit=11.2, dose this also work?

I mean cudatoolkit=11.2 instead of cudatoolkit=11.1?

Kapil_Rana · April 29, 2021, 1:04pm

I am not sure about it. But for me the 11.1 works.
check here

887574002 · April 29, 2021, 1:30pm

11.2 also worked. Thank you.

887574002 · April 29, 2021, 9:23pm

Hi, yesterday when I created a new conda env and then I installed pytorch with cudatoolkit =11.2, it worked, but a half an hour later, when I tried it again it did not work. I got this error message when I was trying to install my own package:

WARNING: Value for scheme.headers does not match. Please report this to <https://github.com/pypa/pip/issues/9617> distutils: /home/envs/vir-env4/include/python3.9/UNKNOWN sysconfig: /home/anaconda3/envs/vir-env4/include/python3.9 WARNING: Additional context: user = False home = None root = None prefix = None Obtaining file:///mnt/home/Baysian Installing collected packages: Baysian-Seg Running setup.py develop for Baysian-Seg WARNING: Value for scheme.headers does not match. Please report this to <https://github.com/pypa/pip/issues/9617> distutils: /home/anaconda3/envs/vir-env4/include/python3.9/UNKNOWN sysconfig: /home/anaconda3/envs/vir-env4/include/python3.9 WARNING: Additional context: user = False home = None root = None prefix = None

Could the error here problematic?

Kapil_Rana · May 1, 2021, 6:15am

Not sure about the error.
But I guess this error is not related to Pytorch.
Try to run this simple example
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

if still got an error in the above example. then there might be an error in installation.

Else the error is related to something else.

887574002 · May 1, 2021, 6:21am

Ok, thank you for the clue. I would do that.

887574002 · May 1, 2021, 11:15am

Hey I did that. It is really working for both simple classifier and my code as well.

But it is super slow. very slow even for the classifier. Do you know what could be the reason ?