RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED when calling backward()

che85 · March 5, 2019, 9:06pm

Hi there,

I went to an NVIDIA workshop and wanted to run the exact same example on my computer at home. For some reason, I am running into a RuntimeError.

(code from NVIDIA workshop)

    EPS = 0.00001

    class DiceLoss(Module):
        def forward(self, input, target):
            num = (input * target).sum(dim=4, keepdim=True).sum(dim=3, keepdim=True).sum(dim=2, keepdim=True)
            den1 = input.pow(2).sum(dim=4, keepdim=True).sum(dim=3, keepdim=True).sum(dim=2, keepdim=True)
            den2 = target.pow(2).sum(dim=4, keepdim=True).sum(dim=3, keepdim=True).sum(dim=2, keepdim=True)

            dice = (2.0 * num / (den1 + den2 + EPS))
            return (1.0 - dice).mean()

The error only appears on running backward.

Device info:
2 x Titan RTX with compiled CUDA 10.0

OS: RHEL

Installation of pytorch was done with the following

pip3 install https://download.pytorch.org/whl/cu100/torch-1.0.1.post2-cp36-cp36m-linux_x86_64.whl
pip3 install torchvision

I am thankful for any help.

To the best of my knowledge, tensors were moved to the GPU via .cuda()

ptrblck · March 6, 2019, 12:00pm

Is the code running fine of your CPU?
If so, could you disable cuDNN and try it again on your GPUs (torch.backends.cudnn.enabled = False)?

che85 · March 6, 2019, 4:15pm

Thanks @ptrblck for the quick reply.

If disable cudnn, the error doesn’t show up, but it’s running out of memory.

RuntimeError: CUDA out of memory. Tried to allocate 3.57 GiB (GPU 0; 23.62 GiB total capacity; 22.08 GiB already allocated; 691.56 MiB free; 5.43 MiB cached)

I am training a VNET with input image sizes [304, 304, 96]

che85 · March 6, 2019, 4:16pm

Doe you have any suggestions what I should do regarding the cudnn execution error?

ptrblck · March 6, 2019, 10:02pm

Could you try to lower the size of your input to avoid the OOM issue and see if the code runs fine without cuDNN enabled?
Also, is your code running on CPU? The error messages might be a bit clearer if thrown from a CPU run.

che85 · March 29, 2019, 8:14pm

Hi @ptrblck,
It’s working just fine on CPU. Strangely it works with volumes of smaller size, but I cannot see any memory exhaustion when observing GPU memory. I am using a VNET implementation and input volume size of 304,304,96

Traceback (most recent call last):
  File "train.py", line 82, in <module>
    main(config, args.resume)
  File "train.py", line 51, in main
    trainer.train()
  File "/home/sources/project/source/base/base_trainer.py", line 92, in train
    result = self._train_epoch(epoch)
  File "/home/sources/project/source/trainer/trainer.py", line 71, in _train_epoch
    loss.backward()
  File "/usr/local/lib64/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I also moved to use half precision (NVIDIA apex) and while testing on the “large” volumes, I ran into this issue again because I had to enable CUDNN.

ptrblck · March 29, 2019, 9:40pm

Is the code publicly available?
I don’t have a Titan RTX, but would like to try it on my 1080Ti.

che85 · April 1, 2019, 6:46pm

It’s not public.

One thing that appears to be really strange: When I change the batch size for a dataset where CUDNN_STATUS_EXECUTION_FAILED happens (e.g. change it from BS 2 to BS 6), then it works.

Do you have any idea how that could cause it to crash?

ptrblck · April 1, 2019, 10:53pm

Could you try to collect the cuDNN log as @ngimel described here?

Also, could you run the script again with CUDA_LAUNCH_BLOCKING=1 python script.py args?

This is just a wild guess, but do you see your GPU(s) running out of memory?
While this might be counter-intuitive, since a larger batch size runs fine, cuDNN might try to use different algorithms (using less memory) for different batch sizes. I’m really not sure at this point, what might cause this issue so I’m just trying to guess what might happen.

che85 · May 16, 2019, 8:20pm

@ptrblck I finally got to test a bit more. Using different batch sizes worked for a while but now I changed input data and it pretty much fails with all batch sizes that I have tried.

CUDA_LAUNCH_BLOCKING=1 didn’t do anything, allocated some memory on the GPU but then somehow got stuck and didn’t react on any keyboard interrupts.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Gradient overflow.  Skipping step, reducing loss scale to 32768.0
Train Epoch: 1 [0/111 (0%)] Loss: 197.427246
Gradient overflow.  Skipping step, reducing loss scale to 16384.0
Train Epoch: 1 [36/111 (30%)] Loss: 208.496323
Train Epoch: 1 [72/111 (60%)] Loss: 261.294617
Traceback (most recent call last):
  File "/usr/local/lib64/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I! CuDNN (v7402) function cudnnGetConvolutionBackwardFilterAlgorithmMaxCount() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i! Time: 2019-05-16T16:12:59.272653 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=0; Handle=0x7fa31800ad00; StreamId=(nil) (defaultStream).


I! CuDNN (v7402) function cudnnGetConvolutionBackwardFilterAlgorithmMaxCount() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i! Time: 2019-05-16T16:12:59.272662 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=0; Handle=0x7fa31800ad00; StreamId=(nil) (defaultStream).


I! CuDNN (v7402) function cudnnGetConvolutionBackwardFilterWorkspaceSize() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i!     xDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[2,16,128,128,64];
i!         strideA: type=int; val=[16777216,1048576,8192,64,1];
i!     dyDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[2,16,128,128,64];
i!         strideA: type=int; val=[16777216,1048576,8192,64,1];
i!     convDesc: type=cudnnConvolutionDescriptor_t:
i!         mode: type=cudnnConvolutionMode_t; val=CUDNN_CROSS_CORRELATION (1);
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         mathType: type=cudnnMathType_t; val=CUDNN_TENSOR_OP_MATH (1);
i!         arrayLength: type=int; val=3;
i!         padA: type=int; val=[1,1,1];
i!         strideA: type=int; val=[1,1,1];
i!         dilationA: type=int; val=[1,1,1];
i!         groupCount: type=int; val=1;
i!     dwDesc: type=cudnnFilterDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         vect: type=int; val=0;
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[16,16,3,3,3];
i!         format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NCHW (0);
i!     algo: type=cudnnConvolutionBwdFilterAlgo_t; val=CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 (1);
i! Time: 2019-05-16T16:12:59.272682 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=0; Handle=0x7fa31800ad00; StreamId=(nil) (defaultStream).


I! CuDNN (v7402) function cudnnConvolutionBackwardFilter() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i!     alpha: type=CUDNN_DATA_FLOAT; val=1.000000;
i!     xDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[2,16,128,128,64];
i!         strideA: type=int; val=[16777216,1048576,8192,64,1];
i!     xData: location=dev; addr=0x7f9f38000000;
i!     dyDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[2,16,128,128,64];
i!         strideA: type=int; val=[16777216,1048576,8192,64,1];
i!     dyData: location=dev; addr=0x7f9f40000000;
i!     convDesc: type=cudnnConvolutionDescriptor_t:
i!         mode: type=cudnnConvolutionMode_t; val=CUDNN_CROSS_CORRELATION (1);
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         mathType: type=cudnnMathType_t; val=CUDNN_TENSOR_OP_MATH (1);
i!         arrayLength: type=int; val=3;
i!         padA: type=int; val=[1,1,1];
i!         strideA: type=int; val=[1,1,1];
i!         dilationA: type=int; val=[1,1,1];
i!         groupCount: type=int; val=1;
i!     algo: type=cudnnConvolutionBwdFilterAlgo_t; val=CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 (1);
i!     workSpace: location=dev; addr=0x7fa026920000;
i!     workSpaceSizeInBytes: type=size_t; val=2754704;
i!     beta: type=CUDNN_DATA_FLOAT; val=0.000000;
i!     dwDesc: type=cudnnFilterDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         vect: type=int; val=0;
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[16,16,3,3,3];
i!         format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NCHW (0);
i!     dwData: location=dev; addr=0x7fa059ff3000;
i! Time: 2019-05-16T16:12:59.272707 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=0; Handle=0x7fa31800ad00; StreamId=(nil) (defaultStream).


I! CuDNN (v7402) function cudnnDestroyConvolutionDescriptor() called:
i! Time: 2019-05-16T16:12:59.272739 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnDestroyFilterDescriptor() called:
i! Time: 2019-05-16T16:12:59.272749 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnDestroyTensorDescriptor() called:
i! Time: 2019-05-16T16:12:59.272757 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnDestroyTensorDescriptor() called:
i! Time: 2019-05-16T16:12:59.272765 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnSetStream() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i!     streamId: type=cudaStream_t; streamId=(nil) (defaultStream);
i! Time: 2019-05-16T16:12:59.272774 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=0; Handle=0x7fa31800ad00; StreamId=(nil) (defaultStream).


I! CuDNN (v7402) function cudnnCreateTensorDescriptor() called:
i! Time: 2019-05-16T16:12:59.272789 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnSetTensorNdDescriptor() called:
i!     dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!     nbDims: type=int; val=5;
i!     dimA: type=int; val=[1,16,1,1,1];
i!     strideA: type=int; val=[16,1,1,1,1];
i! Time: 2019-05-16T16:12:59.272799 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnCreateTensorDescriptor() called:
i! Time: 2019-05-16T16:12:59.272808 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnSetTensorNdDescriptor() called:
i!     dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!     nbDims: type=int; val=5;
i!     dimA: type=int; val=[2,16,128,128,64];
i!     strideA: type=int; val=[16777216,1048576,8192,64,1];
i! Time: 2019-05-16T16:12:59.272818 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnConvolutionBackwardBias() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i!     alpha: type=CUDNN_DATA_FLOAT; val=1.000000;
i!     srcDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[2,16,128,128,64];
i!         strideA: type=int; val=[16777216,1048576,8192,64,1];
i!     srcData: location=dev; addr=0x7f9f40000000;
i!     beta: type=CUDNN_DATA_FLOAT; val=0.000000;
i!     destDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[1,16,1,1,1];
i!         strideA: type=int; val=[16,1,1,1,1];
i!     destData: location=dev; addr=0x7fa5e0efe200;
i! Time: 2019-05-16T16:12:59.272833 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=0; Handle=0x7fa31800ad00; StreamId=(nil) (defaultStream).


I! CuDNN (v7402) function cudnnDestroyTensorDescriptor() called:
i! Time: 2019-05-16T16:12:59.272849 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnDestroyTensorDescriptor() called:
i! Time: 2019-05-16T16:12:59.272857 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12826; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnDestroy() called:
i! Time: 2019-05-16T16:12:59.477687 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12596; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnDestroy() called:
i! Time: 2019-05-16T16:12:59.477978 (0d+0h+1m+10s since start)
i! Process=12596; Thread=12596; GPU=NULL; Handle=NULL; StreamId=NULL.

Let me know, if you have any ideas what I can try next.

Thanks a lot,
Christian

ptrblck · May 17, 2019, 11:10am

Thanks for the information, Christian!

What did you change regarding the input data?
Could you post the link to the workshop you were doing?
Also, could you post your current setup, i.e. PyTorch version, CUDA version, cuDNN version, GPUs used etc, as I would like to reproduce this error.

che85 · May 17, 2019, 2:23pm

The input data changed from

Input image size: (304,304,96)
Input label (target): (304,304,96) binary image
pixel spacing = 0.25

to

Input size: (256,256,128)
Input (target): (256,256,128) distance map 
pixel spacing = 0.25

pytorch version: 1.0.1.post2
CUDA version: 10.0.130
cuDNN version: 7.4.2
GPUs: 2x NVIDIA Titan RTX

The original workshop is from NVIDIA and its name is 3-D Segmentation for Medical Imaging with V-Net

You can find it here: https://courses.nvidia.com/courses

I only used it as a template though. I basically only used their VNET model and incorporated it into https://github.com/victoresque/pytorch-template

I hope that helps.

Let me know if you want me to try something. I am basically stuck right now because I can’t train the model without using cuDNN since it doesn’t fit into memory.

Thanks a lot,
Christian

che85 · May 17, 2019, 7:23pm

@ptrblck I just tried running it with cuDNN enabled but this time only using 1 GPU and it doesn’t crash.

Here some more system information. Let me know if I should test something.

Driver Version: 410.48

NAME="Red Hat Enterprise Linux Server"
VERSION="7.6 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.6"
PRETTY_NAME="Red Hat Enterprise Linux"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.6:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.6"
Red Hat Enterprise Linux Server release 7.6 (Maipo)
Red Hat Enterprise Linux Server release 7.6 (Maipo)

mcarilli · May 17, 2019, 10:33pm

Sometimes CUDNN_STATUS_EXECUTION_FAILED can result when there is a mismatch between the bare-metal Cuda toolkit version and the version used to compile your Pytorch binaries. Can you list the following:

$ nvcc --version
$ python
 >>> import torch
 >>> torch.version.cuda # this tells you the version of cuda that was used to compile your pytorch binary

che85 · May 18, 2019, 2:29am

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

>>> import torch
>>> torch.version.cuda
'10.0.130'

ptrblck · May 23, 2019, 6:09pm

Is your code running fine now using a single GPU?
If so, could you post the batch size and input shape you are currently using?

I’ve just got feedback that the 3D dataset in this course eats up nearly all the memory on 4 V100s.
Could thus an OOM error still be the cause of this issue for your use case?

che85 · May 23, 2019, 6:42pm

@ptrblck yes it’s running on one GPU on the titan RTX with mixed precision.

che85 · May 24, 2019, 5:05pm

@ptrblck I think it runs for a few iterations with multiple GPUs but then crashes in the end of the first epoch. In the end of the epoch it seems to allocate more memory and then crashes. I also tried this with one GPU and cuDNN enabled and the same happens at about the same position.

Is there anything done in the background that I am not aware of?

ptrblck · May 24, 2019, 5:16pm

Do you see the increased memory usage still in the first epoch, after the validation of the first epoch or at the beginning of the second epoch?
I cannot see anything strange in the code, so if almost all your memory is used already, this might be due to memory fragmentation. Could you try to remove the instantiation of the optimizer inside run_training, just initialize it once, and pass to run_training?

che85 · May 24, 2019, 5:33pm

Memory is almost completely allocated and then it happens in the end of the epoch not even getting to the validation step.

Train Epoch: 1 [102/111 (91%)] Loss: 209794.437500
Train Epoch: 1 [104/111 (93%)] Loss: 188173.593750
Train Epoch: 1 [106/111 (95%)] Loss: 205403.281250
Train Epoch: 1 [108/111 (96%)] Loss: 504861.593750
Traceback (most recent call last):
  File "train.py", line 92, in <module>
    main(config, args.resume)
  File "train.py", line 61, in main
    trainer.train()
  File "source/base/base_trainer.py", line 91, in train
    result = self._train_epoch(epoch)
  File "source/trainer/trainer.py", line 77, in _train_epoch
    scaled_loss.backward()
  File "/usr/local/lib64/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I am not using the code from that NVIDIA workshop. I only adopted their VNET model for testing. This template is the one I am using. https://github.com/victoresque/pytorch-template