CUDNN_STATUS_INTERNAL_ERROR for large image sizes on RTX GPU

I’m facing a CUDNN_STATUS_INTERNAL_ERROR on machines that have an RTX Quadro 8000 GPU.
This happens only when I set cudnn.benchmark equal to True.
I do not face the same error on other machines that have the same python/pytorch/cuda/cudnn versions installed.

I was able to reproduce the error using a simple code that is given below:

import time
import torch
from torch import nn, optim
import torch.utils.data as data_utils
import torchvision.models as models
from torch.backends import cudnn
from torch.nn import functional as F

cudnn.enabled = True
cudnn.benchmark = True
device = torch.device("cuda")
torch.cuda.set_device(2)
batch_size=1
img_size=1024
N=320
lr=0.001
channels=3

train_data = torch.randn(N, channels, img_size, img_size)
train_labels = torch.ones(N).long()

train = data_utils.TensorDataset(train_data, train_labels)
train_loader = data_utils.DataLoader(train, batch_size=batch_size, shuffle=True, pin_memory=True)

criterion = nn.CrossEntropyLoss().cuda()

model = models.densenet161().cuda()
#model = models.resnet18().cuda()

model.train()
optimizer = optim.Adam(model.parameters(), lr=lr)
for x,y in train_loader:
    x, y = x.to('cuda',non_blocking=True), y.to('cuda',non_blocking=True)
    pred = model(x)
    print('forward done')

    loss = criterion(pred, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Some observations while running the above code on the machines that give the error:

  1. When I decrease img_size to 512 I do not face the same error
  2. When I change model to resnet I do not face the same error
  3. Changing batch size does not seem to have any effect

I would appreciate any help or information to help debug the source of the problem.

Library versions:
Python 3.7.5
Pytorch1.3.1
Cuda 10.1
Cudnn 7.6.3
Ubuntu 18.04

Also I understand from other posts that cudnn.benchmark looks for the best implementation for the particular hardware and image size and then uses that throughout the training. However, I could not find a good explanation of what cudnn.enabled does. I see that even if I set cudnn.benchmark as False I do get a boost in performance by just setting cudnn.enabled as True. How does it help?

torch.backends.cudnn.enabled = True uses cudnn, while ... = False disables it and falls back to the native PyTorch implementations.

cudnn.benchmark = True could fail to find a fast kernel, but should fall back to a (slower but) working algorithm, which is not the case for your workload.
Could you update PyTorch to the latest stable version (1.5) or the nightly binaries, as we’ve recently implemented another fallback mechanism, which should raise a proper error messge.

1 Like