Loss.backward() -> RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

baloglub · July 12, 2019, 4:34pm

Hi all, I am trying to train a model but I’ve got an error that is “RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED”. I’ve been trying to solve this problem for a week. I got this problem when this part of code runs:

loss.backward()

and the full version of the error is:

Traceback (most recent call last):
  File "train.py", line 91, in <module>
    train()
  File "train.py", line 45, in train
    loss.backward()
  File "/home/User/.local/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/User/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

My graphical card is Nvidia RTX 2060 (Mobile).
I run with python version 3.6.8.
The installed torch and torchvision are installed properly by the guide of official pytorch installation documents.
I installed cuda and cudnn from Nvidia’s official sources and cuda version is 10.0, cudnn version is cuDNN v7.6.1 (June 24, 2019), for CUDA 10.0.
nvcc -V output is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

Paths which are defined in .bashrc file:

export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/cuda-10.0/lib64

nvidia-smi output is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    Off  | 00000000:01:00.0  On |                  N/A |
| N/A   48C    P5    10W /  N/A |    497MiB /  5904MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1710      G   /usr/lib/xorg/Xorg                           229MiB |
|    0      1842      G   /usr/bin/gnome-shell                          88MiB |
|    0     12062      G   ...equest-channel-token=817002320788015348    50MiB |
|    0     12484      G   ...quest-channel-token=3040190697604709129    77MiB |
|    0     12845      G   ...quest-channel-token=2447417469316796923    49MiB |
+-----------------------------------------------------------------------------+

My linux distro is pop-OS 18.04.

If you need more information please tell me. Can anyone help me to solve it?

ptrblck · July 12, 2019, 6:15pm

Is your code running without cudnn? Set torch.backends.cudnn.enabled = False and try to run your code again.

If you see another error message, try to run your code using:

CUDA_LAUNCH_BLOCKING=1 python script.py

and post the stack trace here.

baloglub · July 12, 2019, 8:30pm

Thanks for your reply!

I tried the first line that you shared and I got this error message:

 Train Loss: 2.0851876735687256 1/342 Training Accuracy: 0.171875/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "train.py", line 91, in <module>
    train()
  File "train.py", line 45, in train
    loss.backward()
  File "/home/User/.local/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/User/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:259

Then I tried the second one and the result is:

 Train Loss: 1.6407811641693115 8/342 Training Accuracy: 0.279296875/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=110 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "train.py", line 91, in <module>
    train()
  File "train.py", line 41, in train
    loss = criteration(out, labels)
  File "/home/User/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/User/.local/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 942, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/User/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 2056, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/User/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1871, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:110

BTW the first line was still written in the code, and this is the third part that I removed the first line and ran with the second command, the output is:

Traceback (most recent call last):
  File "train.py", line 90, in <module>
    train()
  File "train.py", line 44, in train
    loss.backward()
  File "/home/User/.local/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/User/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

ptrblck · July 12, 2019, 8:33pm

Thanks for the information.
Could you check that all targets are in the range [0, nb_classes-1].
The stack trace points to nll_loss which probably raises an out of bounds error.

If you run the code on the CPU, the error message should be better and I guess pointing to the out of bounds error.

baloglub · July 12, 2019, 8:48pm

Well, there is no error when I remove the .cuda() parts. I think it runs properly in CPU. What do you recommend? I am barely sure that the index is not a problem because I can run it on another laptop which has GTX 1660. Same code, no error

baloglub · July 12, 2019, 8:49pm

Oh, now I got this error in 34th batch

 Train Loss: 1.5028196573257446 34/342 Training Accuracy: 0.36443014705882354Traceback (most recent call last):
  File "train.py", line 89, in <module>
    train()
  File "train.py", line 39, in train
    loss = criteration(out, labels)
  File "/home/User/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/User/.local/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 942, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/User/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 2056, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/User/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1871, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:92

ptrblck · July 12, 2019, 8:51pm

Haha, alright so batch34 is apparently faulty.
I was wondering what might be going on in your code, but it seems to be the target issue.

baloglub · July 12, 2019, 8:55pm

Haha allright then, I will install torch with anaconda and try it again maybe it is because of pip

ptrblck · July 12, 2019, 8:58pm

No, it’s unrelated to the installation.
Your target in batch34 is not in the range [0, nb_classes-1].
Just add a print statement in your training loop and print out the min and max values for the targets.

baloglub · July 12, 2019, 9:07pm

I will review the code in debug mode, thanks for your help. If the problem continues I will write again

baloglub · July 12, 2019, 10:46pm

Well, I rewrote the code and find the mistake. You were right about the number of classes, I fixed it. The code works properly with the torch.backends.cudnn.enabled = False setting. I trained 3 epoch with no mistake. Then I stopped, because my graphical card started to make a noise such as dj spinner and whilst I am not familiar with the hardware, I don’t know If something is going wrong, do you think that the noise can be harmful or just happened because of a training process? Also, the first error is still the same when cudnn is enabled.

ptrblck · July 12, 2019, 11:14pm

That sounds weird.
How loud is this sound?
It’s quite normal, that your GPU will make some sounds while it’s working.
Some people even claim to be able to distinguish between training a recurrent model and a CNN.
Was your GPU making these sounds from the beginning or did you just notice it now?

That’s not good. How did you install PyTorch?
Did you compile it from source (since you’ve posted the nvcc version) or did you install some binaries?

Could you create a new conda/pip environment and install the latest PyTorch binaries again?
Sometimes reinstalling conda completely seemed to get rid of some bugs.

baloglub · July 13, 2019, 1:40am

The sound started after first epoch, probably there is no problem. I installed pytorch with both conda and pip. Both gave the same result. I also tried with different virtual environments. I also reinstalled conda and format my laptop. Still gotting same error. Anything that I can try?

arosado · December 7, 2019, 10:19pm

I’m also having this same error when using loss.backward() and I’ve tried debugging by disabling CuDNN. When I have CuDNN disable the model runs, but when I have it enabled it doesn’t. I’ve tried different ways of install pytorch with cuda, but most recently I’m using anaconda with the cudatoolkit=10.1.

Is there a way to debug if CuDNN is working properly with my installation beyond looking at the cuda version number? The version number doesn’t seem to change despite changing my path variable.

ptrblck · December 7, 2019, 10:24pm

Could you post a (small) code snippet to reproduce this issue, please?

arosado · December 7, 2019, 10:54pm

n_categories = 5
batch_size = 32
input_size = 2
hidden_size = 2
lstm_layers = 50

model = LSTMClassifier(n_categories, batch_size, input_size, hidden_size, lstm_layers, dropout=0.5).cuda()

weightSampler = torch.utils.data.WeightedRandomSampler(weights=bfp_data_setup.weights, num_samples=len(bfp_data_setup.weights), replacement=True)

train_dataloader = DataLoader(bfp_data_setup.dataset, batch_size=batch_size, sampler=weightSampler)

loss_fn = torch.nn.CrossEntropyLoss(size_average=False)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

for t in range(num_epochs):    
    for batch_padded_sequences, batch_packed_lengths, batch_packed_categories in train_dataloader:
        
        cuda_batch_padded_seq = batch_padded_sequences.cuda()
        cuda_batch_padded_lens = batch_packed_lengths.cuda()
        cuda_batch_packed_categories = batch_packed_categories.cuda()
        
        hidden = model.init_hidden(batch_packed_lengths.size(0))
        
        optimizer.zero_grad()
        model.zero_grad()
        
        decoded, output, hidden = model(cuda_batch_padded_seq, cuda_batch_padded_lens, hidden)
        
        loss = loss_fn(decoded, cuda_batch_packed_categories)
        
        epochLoss = loss.item()
        
        print("Epoch {}, Adam {}".format(t, epochLoss))

        loss.backward()
        
        optimizer.step()

ptrblck · December 8, 2019, 12:35am

Could you post the model definition of LSTMClassifier as well as the shapes of all used tensors, please?

arosado · December 9, 2019, 6:14pm

The LSTMClassifier is currently defined as the following:

class LSTMClassifier(nn.Module):
    def __init__(self, n_categories, batch_size, n_inp, n_hid, n_layers, dropout=0.5):
        super(LSTMClassifier, self).__init__()
        self.n_hid = n_hid
        self.n_layers = n_layers
        self.batch_size = batch_size
        self.n_categories = n_categories

        self.lstm = nn.LSTM(input_size=n_inp, hidden_size=n_hid, num_layers=n_layers, dropout=dropout, batch_first=True)
        self.hidden_to_category = nn.Linear(n_hid, n_categories)

        count = 0
        for param in self.lstm.parameters():
            self.register_parameter(name="lstm_{}".format(count), param=param)
            count += 1

        count = 0
        for param in self.hidden_to_category.parameters():
            self.register_parameter(name="hidden_{}".format(count), param=param)
            count += 1

        self.init_weights()

    def init_weights(self):
        initrange = 1
        self.hidden_to_category.bias.data.zero_().cuda()
        self.hidden_to_category.weight.data.uniform_(-initrange, initrange).cuda()

    def forward(self, input, lengths, hidden):
        batch_packed_sequences = nn.utils.rnn.pack_padded_sequence(input, lengths, batch_first=True, enforce_sorted=False)
        output, (hidden_last_t, _) = self.lstm(batch_packed_sequences, hidden)
        hidden_last_t_hidden_layer = hidden_last_t[-1, :]
        decoded = self.hidden_to_category(hidden_last_t_hidden_layer)
        return decoded, output, hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters())
        return (
            weight.new_zeros(self.n_layers, batch_size, self.n_hid),
            weight.new_zeros(self.n_layers, batch_size, self.n_hid)
        )

The previous version I posted didn’t feed the right lstm output given what is given is a packed sequence to the following linear layer. My inputs are the following within the first minibatch:


cuda_batch_padded_seq shape=[32, 16958, 2]
cuda_batch_padded_len shape=[32]
hidden shape=([20, 32, 10], [20, 32, 10])
decoded shape=[32, 5]
cuda_batch_categories shape=[32]

Changing torch.backends.cudnn.enabled = False allows the model to proceed without any problems. I’ve changed the lstm layers, hidden layers, and batch_size to be of different sizes, but regardless of what I change them to I still get the same CUDNN_STATUS_EXECUTION_FAILED error. I’ve checked the memory usage using nvidia-smi and it looks like I have plenty room with and without cudnn enabled.

Your help figuring out how I can train with CUDNN enabled would be greatly appreciated!

arosado · December 13, 2019, 2:25pm

For anyone who is interested, I determined that CUDNN error message went away if I made my sequences significantly smaller. In this particular situation, I could not find an error by disabling CUDNN. In general, using an lstm for a really long sequence isn’t a good strategy anyways.

jinfagang · March 16, 2020, 8:05am

I also got a same problem which is really really weird:

 totalloss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.

error got on my RTX2080TI card with cuda 10.1

While on my GTX1080TI cuda 10.1, everything works fine…

Totally don’t know why