Backward calcalution fails with batch size >1 while using cudnn with error CUDNN_STATUS_INTERNAL_ERROR

Hi

I’m at a loss with this issue for some time now and it’s blocking my research.

On a certain dataset I use, the loss.backward calculation fails with the error below. It happens only when using cudnn, with a batch size > 1 and on nvidia rtx 20xx cards. With 1080 cards everything works fine, also when I use a different dataset or set batch size to be 1 or disable cudnn.
I’m using ubuntu 20.04, cuda 11.2 and cudnn 8.0.

I’ve seen similar issues in the forum, without solutions.

Thanks for any help

  • Error log (with CUDA_LAUNCH_BLOCKING=1):
    , in train
    loss_sum.backward()
    File “/external/conda/lib/python3.8/site-packages/torch/tensor.py”, line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
    File “/external/conda/lib/python3.8/site-packages/torch/autograd/init.py”, line 130, in backward
    Variable._execution_engine.run_backward(
    RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
    You can try to repro this exception using the following code snippet. If that doesn’t trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 4, 5, 360, 640], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv3d(4, 1, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 1]
stride = [1, 1, 1]
dilation = [1, 1, 1]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x55e8cdd6dc30
type = CUDNN_DATA_FLOAT
nbDims = 5
dimA = 2, 4, 5, 360, 640,
strideA = 4608000, 1152000, 230400, 640, 1,
output: TensorDescriptor 0x7f0ea8015430
type = CUDNN_DATA_FLOAT
nbDims = 5
dimA = 2, 1, 5, 360, 640,
strideA = 1152000, 1152000, 230400, 640, 1,
weight: FilterDescriptor 0x7f0ea80410d0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 5
dimA = 1, 4, 3, 3, 3,
Pointer addresses:
input: 0x7f0eb2000000
output: 0x7f0ed0c00000
weight: 0x7f0fb9bff800

I cannot reproduce this error on an RTX2080Ti using the posted code snippet with cudnn8.1 and CUDA11.2.
Could you post an executable code snippet which would reproduce this error, as the suggested cudnn unit test isn’t able to reproduce it?

Thanks for replying.

I cleaned up my environment, reinstalled and verified cudnn and reinstalled torch using pip. The code snippet above still fails for me, as is the following code:

import torch
from torch.backends import cudnn
torch.backends.cudnn.benchmark = True
out = torch.randn([2, 4, 3, 360, 640], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv3d(4, 1, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1]).cuda()
out = net(out)
out.sum().backward()

Setting torch.backends.cudnn.benchmark = False succeeds. Additionally the following code succeeds:

import torch
from torch.backends import cudnn
torch.backends.cudnn.benchmark = True
out = torch.randn([2, 4, 3, 360, 640], dtype=torch.float, device=‘cuda’, requires_grad=True)
out.sum().backward()

I’ve seen similar issues reported in this forum and in github:

Frankly I’m not sure how to proceed with this and setting benchmark=False is not the solution I’d want.

I still cannot reproduce the failure on the RTX2080Ti I’m using with the binaries for CUDA11.0 + cudnn8 and CUDA10.2 + cudnn7.
Which GPUs exactly are you using and could you fill out the environment collection script and post it here as well?

/external/conda/bin/python /home/ronk/.config/JetBrains/PyCharmCE2020.3/scratches/scratch_22.py
Collecting environment information…
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 2070 with Max-Q Design
Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] torch==1.7.1+cu110
[pip3] torchaudio==0.7.2
[pip3] torchcontrib==0.0.2
[pip3] torchvision==0.8.2+cu110
[conda] numpy 1.20.1 py38h18fd61f_0 conda-forge
[conda] torch 1.7.1+cu110 pypi_0 pypi
[conda] torchaudio 0.7.2 pypi_0 pypi
[conda] torchcontrib 0.0.2 pypi_0 pypi
[conda] torchvision 0.8.2+cu110 pypi_0 pypi

Referencing corresponding bug: Pytorch autograd sometimes fails with CUDNN_STATUS_INTERNAL_ERROR · Issue #52263 · pytorch/pytorch · GitHub
As stated should be fixed with torch 1.8 and cudnn 8.1.
I’m testing it and will update.