Linux RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

paozaf · April 27, 2021, 10:03am

Hi all,
I’m writing a GAN for image translation.

I’m using a kinda unet as a generator and a resent (from the torch hub) as a discriminator (modified to have 1 class, sigmoid).

I train the models by alternating minibatches (discriminator, generator, discriminator, and so on).
The first discriminator minibatch runs, the first generator minibatch runs, the second discriminator minibatch runs too but when it switches to the second generator again I get this:

Traceback (most recent call last):
File “/home/paolo/cycleGAN/net_CT/35/…/…/cycleGAN/train.py”, line 704, in
run_training(models, trainCases, epoch, lp, max_image_shape, device)
File “/home/paolo/cycleGAN/net_CT/35/…/…/cycleGAN/train.py”, line 442, in run_training
loss_G.backward()
File “/home/paolo/cycleGAN/venv_nightly/lib/python3.9/site-packages/torch/_tensor.py”, line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/home/paolo/cycleGAN/venv_nightly/lib/python3.9/site-packages/torch/autograd/init.py”, line 147, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

My environment is Arch Linux box, nvcc 11.2, python 3.9.1, cudnn 8.0.5 (but it is installed also cudnn 6.0).
I get the same error both if I install the nightly or the stable (pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html).

I think the models and the metrics are ok since a few minibatches run.

Thank you.

paozaf · April 27, 2021, 1:08pm

If I switch to CPU it works, maybe it can be an useful information.

ptrblck · April 28, 2021, 5:37am

Could you post a minimal code snippet to reproduce this issue (model definition and the input shapes would be enough) as well as the output of python -m torch.utils.collect_env, please?

paozaf · April 28, 2021, 10:19am

Hi @ptrblck, thanks for helping.

Unfortunately, I do not have a snippet of self running code (the data extraction and preprocessing is a key step in my application), but I can provide a partial view of the training function (some declarations and variables are not reported, in some points it looks like pseudocode).
Here the snippet:

def run_training(models, device, …):

# Set optimizers
optimizerG = optim.Adam(models["generator"].parameters(), lr=lr_G, weight_decay=lp["lambda"])
optimizerD = optim.Adam(models["discriminator"].parameters(), lr=lr_D, weight_decay=lp["lambda"])

# Deine adversial loss
adversial_loss = nn.BCELoss().to(device)

# mini batch stuff
g_minibatch_counter = 0
d_minibatch_counter = 0
minibatch_size = 20
minibatch_status = "discriminator"

for i, index_shuf in enumerate(indexes_shuf):

    # Assign images to batch and run only if the selected view matches
    if augmented_X is not None and augmented_Y is not None:
        batchIndex+=1

        # If batch is full, start training
        if batchIndex == lp["batchSize"]:

            # Define labels for discriminato training
            real_label = torch.full((lp["batchSize"], 1), randrange(7, 12, 1)*0.1, dtype=torch.float, device=device)
            fake_label = torch.full((lp["batchSize"], 1), randrange(0, 3, 1)*0.1, dtype=torch.float, device=device)

            # Convert numpy object to pytorch tensor
            Y_batch_tensor, Xes_batch_tensor = convert_arrays_to_tensors(Y_batch, Xes_batch)

            # Run conversion
            Y_pred = models["generator"](Xes_batch_tensor)

            #####################
            # Train discriminator
            #####################
            if minibatch_status == "discriminator":
                models["generator"].eval()
                models["discriminator"].train()

                # Set the discriminator gradients to zero
                models["discriminator"].zero_grad()

                if d_minibatch_counter <= int(minibatch_size/2):
                    # real
                    score_d_real = models["discriminator"](Xes_batch_tensor)
                    loss_D_real = adversial_loss(score_d_real, real_label)
                    loss_D_real.backward()
                    loss_D = loss_D_real
                else:
                    # fake
                    score_d_fake = models["discriminator"](Y_pred.detach())
                    loss_D_fake = adversial_loss(score_d_fake, fake_label)
                    loss_D_fake.backward()
                    loss_D = loss_D_fake

                # Update minibatch info
                d_minibatch_counter +=1
                if d_minibatch_counter == minibatch_size:
                    optimizerD.step()
                    minibatch_status = "generator"
                    d_minibatch_counter = 0

            #####################
            # Train generator
            #####################
            if minibatch_status == "generator":

                models["generator"].train()
                models["discriminator"].eval()

                # Set the generator gradients to zero
                models["generator"].zero_grad()

                # Compute generative intensity loss
                loss_G_intensity, loss_G_type_str = compute_loss(lp, "train", Y_batch_tensor, Y_pred)

                # Compute l1 penalty
                loss_G_norm = compute_l1_norm(models["generator"], lp["lambda"])

                # Compute losses
                score_d_g = models["discriminator"](Y_pred)
                loss_D_G = adversial_loss(score_d_g, real_label)

                loss_G = loss_G_intensity + loss_G_norm + (lp["discriminatorWeight"] * loss_D_G)

                # Run optimizer
                loss_G.backward()
                optimizerG.step()

                # Update minibatch info
                g_minibatch_counter += 1
                if g_minibatch_counter == minibatch_size:
                    minibatch_status = "discriminator"
                    g_minibatch_counter = 0

            # Reset batch index counter
            batchIndex=0

models = {}
models[“generator”] = models_arch.Generator(lp).to(device)
models[“generator”].apply(weight_init.weight_init)

models[“discriminator”] = torch.hub.load(‘pytorch/vision’, ‘resnet152’, pretrained=False)
models[“discriminator”].conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
models[“discriminator”].fc = nn.Sequential(nn.Linear(models[“discriminator”].fc.in_features,512),
nn.ReLU(),
nn.Dropout(),
nn.Linear(512, 1),
nn.Sigmoid())
models[“discriminator”] = models[“discriminator”].to(device)

run_training(…)

On CPU it runs, so I don’t think it’s something related to mismatch or something like that.

python -m torch.utils.collect_env returns the following:

Collecting environment information…
PyTorch version: 1.8.1+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Arch Linux (x86_64)
GCC version: (GCC) 10.2.0
Clang version: Could not collect
CMake version: version 3.19.3

Python version: 3.9 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: TITAN Xp
GPU 1: Quadro P6000

Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/opt/cudnn6/lib64/libcudnn.so.6.0.21
/usr/lib/libcudnn.so.8.0.5
/usr/lib/libcudnn_adv_infer.so.8.0.5
/usr/lib/libcudnn_adv_train.so.8.0.5
/usr/lib/libcudnn_cnn_infer.so.8.0.5
/usr/lib/libcudnn_cnn_train.so.8.0.5
/usr/lib/libcudnn_ops_infer.so.8.0.5
/usr/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.8.1+cu111
[pip3] torchaudio==0.8.1
[pip3] torchvision==0.9.1+cu111
[conda] Could not collect

Thank you!

ptrblck · April 28, 2021, 9:11pm

Thanks! Could you post the missing definitions (of the model etc.) as well as the input shapes, so that we could reproduce and debug this issue?

paozaf · April 29, 2021, 11:21am

Hi @ptrblck ,
unfortunately, I can not share the entire code (sorry, not my decision).
However, the generator is an encoder/decoder network (u-net like) that accepts in input a tensor of [batch_size, 1, 256, 256] and returns an image of the same shape.

As I said, if I use the CPU it runs smoothly, so the size match should be ok.
If it can help, I tried with a VGG (instead of ResNet) and I get the same error.

Thanks a lot!

ptrblck · April 29, 2021, 9:45pm

I don’t think it’s a shape mismatch error, but an internal cudnn issue, which is why we would need to get a code snippet to reproduce it.
Also, did you install the pip wheels for 1.8.1? If so, could you create a new env and install the conda binaries?

paozaf · April 30, 2021, 1:57pm

Hi @ptrblck
I extracted a snippet of code. It’s a GAN for image translation (the current example is meaningless since it uses fake inputs).
In this case, if I run it on the CPU it works, if I switch to the GPU it crashes.

The toy code is this (I tried to paste it here but the indentation is not good):
https://pastebin.pl/view/551b716c

I’m using the weels for 1.8.1, but my nvcc version is 11.2 (not 11.1).

Thanks a lot for helping, I really appreciate it.

ptrblck · April 30, 2021, 7:11pm

Thank you for the code! Could you upload it as a GitHub Gist, as pastebin is not accessible from the current setup (if not, I can use another machine later and check the code)?
In the meantime, could you also create a new virtual environment and install the conda binaries (not pip wheels), as I would like to check, if you are seeing an issue for sm_61 with the pip wheels in particular?

The pip wheels and conda binaries ship with the CUDA runtime. Your local CUDA toolkit will only be used, if you want to build PyTorch from source or any custom CUDA extension. As long as the NVIDIA driver is suitable for the CUDA runtime (shipped in the binaries), it should work.

paozaf · April 30, 2021, 11:16pm

@ptrblck here the gist:

gist.github.com

https://gist.github.com/pzaffino/7c3714ffe8eb867eb45b721ac4d2d808

pytorch_error.py

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models
import torch.nn.functional as F
 
class Generator(nn.Module):
    def __init__(self, initial_features, num_channels=1, dropout=0.1):
        super(Generator, self).__init__()

This file has been truncated. show original

Installing Conda will take time, I have to install it without affecting the system python.

ptrblck · May 1, 2021, 1:04am

Thanks a lot for the great code! I’m able to reproduce the issue on a P6000 using the pip wheels, while the conda binaries work fine, so you are most likely hitting this issue.

paozaf · May 3, 2021, 9:04am

Thanks for the support @ptrblck !
I confirm that by using conda it works.

Paolo