Seeking help for a cuDNN error

Hello,

My model raised a cuDNN error on a convolution as described below. As a workaround I am using slightly less filters to make it work so it’s not very urgent to me, but the problem is still there. Please, let me know if you have a solution :slight_smile:

Thomas.

Environment:
OS: Ubuntu 18.04.6 LTS
GPU: RTX 3090
Driver Version: 495.29.05
Running pytorch 1.10.0 in a docker container based on the image pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime from Docker Hub.

Note: When I run nvidia-smi, it display that I’m using CUDA Version: 11.5, while the docker image name suggest 11.3

Code snippet to reproduce the error (error on backward)

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 550, 1, 162001], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(550, 900, kernel_size=[1, 6], padding=[0, 0], stride=[1, 1], dilation=[1, 32400], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

Copy of the error from code sniped

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 550, 1, 162001], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(550, 900, kernel_size=[1, 6], padding=[0, 0], stride=[1, 1], dilation=[1, 32400], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 32400, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x7f5f2c0146c0
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 1, 550, 1, 162001, 
    strideA = 89100550, 162001, 162001, 1, 
output: TensorDescriptor 0x7f5f2c0141a0
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 1, 900, 1, 1, 
    strideA = 900, 1, 1, 1, 
weight: FilterDescriptor 0x7f5f2c029eb0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 900, 550, 1, 6, 
Pointer addresses: 
    input: 0x7f5f30000000
    output: 0x7f5f5b600800
    weight: 0x7f5f5ca00000

What I tried

I tested disabling cudnn with torch.backends.cudnn.enabled=False, it raise a different error and sooner:

>>> import torch
>>> torch.backends.cuda.matmul.allow_tf32 = True
>>> torch.backends.cudnn.benchmark = True
>>> torch.backends.cudnn.deterministic = False
>>> torch.backends.cudnn.allow_tf32 = True
>>> torch.backends.cudnn.enabled=False
>>> data = torch.randn([1, 550, 1, 162001], dtype=torch.half, device='cuda', requires_grad=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.