cuDNN error: CUDNN_STATUS_EXECUTION_FAILED after two epochs

Hi everyone,

I’ve been reading a lot of posts here recently, but none of them helped me so I decided to write my own problem.

I am trying to work with a ResNet50 model and I wrote some scripts that used to work in my CPU. I bought a NVIDIA A40 to speed up those scripts, but I’m not able to run the training anymore using GPU.

Training works for two epochs (at most) and it gives:


right in the moment of ‘loss.backward()’. I’ve got to this error running my code with ‘CUDA_LAUNCH_BLOCKING’ = 1, before I set up that flag the error used to be:


Furthermore, after the error if I open a terminal, the command ‘nvidia-smi’ gives me:

‘Unable to determine the device handle for GPU0000:86:00.0: Unknown Error’

Sorry for not providing a minimal code to reproduce the errors, but I haven’t figure out how to do it since 95% of the script is about reading my personal dataset. I’ve tried to lower my batch_size, but I keep getting the same error.

I can provide you the output of python3 -m torch.utils.collect_env:

ollecting environment information…
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-5.10.0-21-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.66
GPU models and configuration: GPU 0: NVIDIA A40
Nvidia driver version: 530.30.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1
[pip3] torchvision==0.14.1
[conda] Could not collect

And my ‘nvidia-smi’ output (before running the code and getting my error):

| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA A40 On | 00000000:86:00.0 Off | 0 |
| 0% 39C P8 23W / 300W| 4MiB / 46068MiB | 0% Default |
| | | N/A |

| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 N/A N/A 1342 G /usr/lib/xorg/Xorg 4MiB |

Any help would be really appreciated. Thanks in advance.

PS: I don’t work with Anaconda at all and I’m running my script via Jupyter Notebook.

Was this setup working before and if so, what changed?

Could you check dmesg and check for Xids in it? I’ve recently helped debug a similar issue where the GPU was overheating as it wasn’t cooled at all and dropped off the bus (see this thread for more details).

First of all, thanks for replying :slight_smile:

My scripts were running perfectly on that PC before I installed the GPU. What changed was basically that in my ‘train’ function now I perform


in order to perform at GPU level instead of CPU.

I did and couldn’t find it. The command ‘sudo dmesg | grep -i Xids’ doesn’t give an output.

Remove the s and run dmesg | grep -i xid.
Also, could you check if updating to PyTorch 2.0.0 and the CUDA 11.8 runtime might solve the issue?

Same, no output.

I’m working on it, will write here when done. Thanks so much for your attention :slight_smile:

Just finished and I get the same error,

File ~/.local/lib/python3.9/site-packages/torch/nn/modules/, in Conv2d._conv_forward(self, input, weight, bias)
455 if self.padding_mode != ‘zeros’:
456 return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
457 weight, bias, self.stride,
458 _pair(0), self.dilation, self.groups)
→ 459 return F.conv2d(input, weight, bias, self.stride,
460 self.padding, self.dilation, self.groups)


According to the NVIDIA forum posts, it could be either because the GPU didn’t sit into the PCIe slot correctly or moving to another PCIe slot might rectify the problem. Doesn’t hurt to check.

Thanks for the check! Could you post a minimal and executable code snippet reproducing the issue so that I could try to debug it on an A40?

I really want to, but I don’t know how to do it without giving you the data I’m using. Any idea?

Could you try to use random input data as I doubt the error depends on the actual values in the inputs?

Okay I’ll try it, will come later when I have it. Thanks :slight_smile:

Hi, thanks for your reply :slight_smile: I checked and still got the same erorr

Hi :slight_smile: I could reproduce the issue with the following code (sorry if it is not so ‘minimal’, at least is executable)

import os
import shutil
import torch
import pickle
import time
import random
import numpy as np

from torchvision.models import resnet50, ResNet50_Weights
from import DataLoader, Dataset
from torchvision import transforms, datasets
from torch import nn, optim

class RandomDataset(Dataset):
    def __init__(self, size):
        self.imgs = [torch.rand(3,224,224) for _ in range(size)]
        self.labels = [random.randint(0,1) for _ in range(size)]
        self.norm = transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
        self.classes = [0,1]
    def __len__(self):
        return len(self.imgs)
    def __getitem__(self, idx):
        img = self.norm(self.imgs[idx])
        label = self.labels[idx]
        return img, label

def train(n_epochs, 
    classes = dataloaders['train'].dataset.classes
    n_classes = len(classes)
    if use_cuda:
      model = model.cuda()

    loss_dict = {}
    loss_dict['train'], loss_dict['valid'], loss_dict['valid_acc'] = [], [], []
    valid_loss_min = np.Inf
    prev_save = ""
    print("criterion: {}".format(criterion))

    for e in range(1, n_epochs + 1):
      start = time.time()
      train_loss, valid_loss, n_corr = 0., 0., 0
      #################  TRAIN THE MODEL  #################
      for data, target in dataloaders['train']:
        if use_cuda:
          data = data.cuda() 
          target = target.cuda()   # shape: [batch_size]

        output = model(data)    # shape: [batch_size, n_classes]  
        loss = criterion(output, target)
        train_loss += loss.item() 
      #################  VALIDATE THE MODEL  #################
      for data, target in dataloaders['valid']:
        if use_cuda:
          data = data.cuda() 
          target = target.cuda()   # shape: [batch_size]

        output = model(data)    # [batch_size, n_classes]  
        loss = criterion(output, target)  
        valid_loss += loss.item()
        output = output.cpu().detach().numpy()
        n_corr += int(sum([np.argmax(pred)==target[i] for i, pred in enumerate(output)]))

      train_loss = train_loss / len(dataloaders['train'].dataset)
      valid_loss = valid_loss / len(dataloaders['valid'].dataset)
      valid_acc = n_corr/len(dataloaders['valid'].dataset)


      ##  Log result each epoch
      print('Epoch: %d/%d\t Train Loss: %.5f\t Valid Loss: %.5f\t Valid Acc: %.4f\t elapsed time: %.1fs'%(e, n_epochs, train_loss, valid_loss, valid_acc, time.time()-start))

      ##  Save model if the current validation loss is lower than the previous validation loss
      if valid_loss < valid_loss_min:
        if prev_save:
          os.remove("model" + prev_save + ".pt")
          os.remove("loss_dict" + prev_save + ".pkl")
        prev_save = "_" + str(e), "model" + prev_save + ".pt")
        pickle.dump(loss_dict, open("loss_dict" + prev_save + ".pkl", "wb"))
        valid_loss_min = valid_loss
    return loss_dict, model

# It takes a few seconds to generate random data
train_data = RandomDataset(13052)
val_data = RandomDataset(36)

bs = 16
train_dl =, batch_size=bs, shuffle=True)
val_dl =, batch_size=bs, shuffle=True)

dataloaders = {'train': train_dl, 'valid': val_dl}
# Import resnet50 with pretrained weights, and modify FC layer
myModel = resnet50(weights=ResNet50_Weights.DEFAULT)
mlp = [nn.Linear(in_features=2048, out_features=1024, bias=True), nn.Dropout(0.5), nn.PReLU(), 
       nn.Linear(in_features=1024, out_features=512, bias=True), nn.Dropout(0.5), nn.PReLU(), 
       nn.Linear(in_features=512, out_features=2, bias=True), nn.Dropout(0.5), nn.PReLU()]
mlp = nn.Sequential(*mlp)
myModel.fc = mlp

# Define loss function and diferent learning rates for each part of the model

criterion = nn.CrossEntropyLoss()

encoder = []
decoder = []
for name, param in myModel.named_parameters():
  if 'fc' in name:
L_RATE1 = 5e-4
L_RATE2 = 8e-3
optimizer = torch.optim.SGD([{'params':encoder}, {'params':decoder}], lr=DEFOULT_LR, momentum=0.95)
optimizer.param_groups[0]['lr'] = L_RATE1
optimizer.param_groups[1]['lr'] = L_RATE2
use_cuda = torch.cuda.is_available() # Check if GPU is detected
# Train the model
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
n_epochs = 100
print('GPU:', use_cuda)
inicio = time.time()   
loss_dict, model = train(n_epochs=n_epochs, 
final = time.time()
print(f'Training time: {(final - inicio)/60} min.')

This is exactly how my scripts work except for the data part. Hope it helps, thanks in advance :wink: