I’m facing a CUDNN_STATUS_INTERNAL_ERROR on machines that have an RTX Quadro 8000 GPU.
This happens only when I set cudnn.benchmark equal to True.
I do not face the same error on other machines that have the same python/pytorch/cuda/cudnn versions installed.
I was able to reproduce the error using a simple code that is given below:
import time import torch from torch import nn, optim import torch.utils.data as data_utils import torchvision.models as models from torch.backends import cudnn from torch.nn import functional as F cudnn.enabled = True cudnn.benchmark = True device = torch.device("cuda") torch.cuda.set_device(2) batch_size=1 img_size=1024 N=320 lr=0.001 channels=3 train_data = torch.randn(N, channels, img_size, img_size) train_labels = torch.ones(N).long() train = data_utils.TensorDataset(train_data, train_labels) train_loader = data_utils.DataLoader(train, batch_size=batch_size, shuffle=True, pin_memory=True) criterion = nn.CrossEntropyLoss().cuda() model = models.densenet161().cuda() #model = models.resnet18().cuda() model.train() optimizer = optim.Adam(model.parameters(), lr=lr) for x,y in train_loader: x, y = x.to('cuda',non_blocking=True), y.to('cuda',non_blocking=True) pred = model(x) print('forward done') loss = criterion(pred, y) optimizer.zero_grad() loss.backward() optimizer.step()
Some observations while running the above code on the machines that give the error:
- When I decrease img_size to 512 I do not face the same error
- When I change model to resnet I do not face the same error
- Changing batch size does not seem to have any effect
I would appreciate any help or information to help debug the source of the problem.
Also I understand from other posts that cudnn.benchmark looks for the best implementation for the particular hardware and image size and then uses that throughout the training. However, I could not find a good explanation of what cudnn.enabled does. I see that even if I set cudnn.benchmark as False I do get a boost in performance by just setting cudnn.enabled as True. How does it help?