CUDNN_STATUS_INTERNAL_ERROR when loss.backward()

Hi, I’m facing CUDNN_STATUS_INTERNAL_ERROR too, and I’m wondering where the problem comes from.
The full stack trace is listed below

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-3782ca595bc5> in <module>
     72     print('end')
     73 
---> 74 train_model(epochs, train_loader, lstm)

<ipython-input-12-3782ca595bc5> in train_model(epochs, train_loader, model, learning_rate, interval_print)
     64             loss_list.append(loss.item())
     65             optimizer.zero_grad()
---> 66             loss.backward()
     67             torch.nn.utils.clip_grad_norm_(model.parameters(), clipping_value)
     68             optimizer.step()

C:\ProgramData\Anaconda3\envs\pytorch15\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
    196                 products. Defaults to ``False``.
    197         """
--> 198         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    199 
    200     def register_hook(self, hook):

C:\ProgramData\Anaconda3\envs\pytorch15\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     96         retain_graph = create_graph
     97 
---> 98     Variable._execution_engine.run_backward(
     99         tensors, grad_tensors, retain_graph, create_graph,
    100         allow_unreachable=True)  # allow_unreachable flag

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR (_cudnn_rnn_backward_input at ..\aten\src\ATen\native\cudnn\RNN.cpp:931)
(no backtrace available)

The model I adopt is LSTM, the detail is listed below

from torch.autograd import Variable
import torch.nn.init as init
class Net(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, bidirectional, output_dim, dropout=0.3):
        super(Net, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
        self.LSTM = nn.GRU(input_dim, hidden_dim, layer_dim, batch_first=True, bidirectional=bidirectional, dropout=dropout)
        self.out = nn.Linear(hidden_dim*2, output_dim, ) if bidirectional else nn.Linear(hidden_dim, output_dim, )
        
    def forward(self, x):
        # x: (batch, time_step, input_size)
        # LSTM_out: (batch, time_step, output_size)
        # h_n: (n_layers, batch, hidden_size)
        # h_c: (n_layers, batch, hidden_size)
        LSTM_out, h_n = self.LSTM(x, None)
        out = self.out(LSTM_out)
        return out

lstm = Net(
    input_dim=1, 
    hidden_dim=128, 
    layer_dim=5, 
    output_dim=4, 
    bidirectional=True)

The code can work out fine when torch.backends.cudnn.enabled = True, yet the code will miserably face the error when torch.backends.cudnn.enabled = False
and I don’t face the error when I am training CNN.
I’m using Pytorch 1.5.0, RTX 2080ti, CUDA 10.2 and CUDNN 7.6.5 for CUDA 10.2.
The working environment detail is listed below

Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Microsoft Windows 10 Home
GCC version: (i686-posix-dwarf-rev0, Built by MinGW-W64 project) 8.1.0
CMake version: Could not collect

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: GeForce RTX 2080 Ti
Nvidia driver version: 441.22
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.5.0
[pip] torchvision==0.6.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2020.1                      216
[conda] mkl-service               2.3.0            py38hb782905_0
[conda] mkl_fft                   1.0.15           py38h14836fe_0
[conda] mkl_random                1.1.0            py38hf9181ef_0
[conda] pytorch                   1.5.0           py3.8_cuda102_cudnn7_0    pytorch
[conda] torchvision               0.6.0                py38_cu102    pytorch

This error has tortured me for a week. I have tried various versions of Pytorch, Cuda and CUDNN, and I even try to copy the environment of others’s successful working environment. Yet nothing works.
Would you mind giving any advice? Any advice would be great.
Thanks.

I assume you are seeing this error, when you enable cudnn, and the code runs without cudnn?

Could you post a dummy input shape to reproduce this issue, please?

I’m sorry, but I’m not sure what dummy input shape means. Do you mean posting the input shape of my data? The input size is [batch_size, timestep, input_size] = [128, 800, 1]. The training code is listed below.

def train_model(epochs, train_loader, model, learning_rate=0.001, interval_print=100):
    
    if torch.cuda.is_available():
        print("the model is in cuda now")
        model = model.cuda()    
    torch.backends.cudnn.enabled = False
    model.train()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.8)
    criterion = nn.MSELoss()
    for epoch in range(epochs):
        scheduler.step()
        for batch_idx, (wave, label, condition) in enumerate(train_loader):               
            
            input= input.view(-1, SIGNAL_LENGTH//INPUT_SIZE, INPUT_SIZE )     #[batch, timestep, input size] = [128, 800, 1]
            wave = wave.cuda()
            label = label.cuda()
            label = label.view(-1, SIGNAL_LENGTH//INPUT_SIZE, INPUT_SIZE)
            rnn_out = model(wave)
            
            loss = criterion( rnn_out ,label)     # try to prevent the loss suface from being too coarse
            loss_list.append(loss.item())
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), clipping_value)
            optimizer.step() 
            if batch_idx % interval_print == 0:              
                print("In Epoch: ", epoch+1, ", Batch Idx: ", batch_idx+1, "Training Loss: ", loss.item())
    print('end')    

Thanks

Sorry I forgot to answer your question. The code can run normally while disabling cudnn, yet it will meet the error when I enable cudnn

I cannot reproduce this error using your provided code snippet and shapes on an RTX2080Ti (and V100) using PyTorch 1.5.0, CUDA10.2.89 and cudnn7.6.5.32:

class Net(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, bidirectional, output_dim, dropout=0.3):
        super(Net, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
        self.LSTM = nn.GRU(input_dim, hidden_dim, layer_dim, batch_first=True, bidirectional=bidirectional, dropout=dropout)
        self.out = nn.Linear(hidden_dim*2, output_dim, ) if bidirectional else nn.Linear(hidden_dim, output_dim, )

    def forward(self, x):
        # x: (batch, time_step, input_size)
        # LSTM_out: (batch, time_step, output_size)
        # h_n: (n_layers, batch, hidden_size)
        # h_c: (n_layers, batch, hidden_size)
        LSTM_out, h_n = self.LSTM(x, None)
        out = self.out(LSTM_out)
        return out

model = Net(
    input_dim=1,
    hidden_dim=128,
    layer_dim=5,
    output_dim=4,
    bidirectional=True).cuda()


x = torch.randn(128, 800, 1).cuda()
criterion = nn.MSELoss()

out = model(x)
loss = criterion(out, torch.rand_like(out))
loss.backward()

print('done')

Could you check, if this code runs on your machine?

Sorry, I forgot to mention, the error will be met after few epochs. I will give you a toy example soon.

Now I know when the problem will occur, and I have some guesses of the problem.
Let me formulate my problem. Normally, I like to plot the output of the deep learning model and the label to see whether the model’s behavior is normal.
Yet I plot the picture before computing the loss.
Here’s the code which will meet the problem.

epochs = 600
clipping_value = 1

model = Net(
    input_dim=1,
    hidden_dim=128,
    layer_dim=5,
    output_dim=1,
    bidirectional=True).cuda()

# torch.cuda.set_device(0)
torch.backends.cudnn.benchmark = True
def get_learning_rate(optimizer):
    lr=[]
    for param_group in optimizer.param_groups:
          lr +=[ param_group['lr'] ]
    return lr

def train_model(epochs, model, learning_rate=0.001, interval_print=100):
    
    if torch.cuda.is_available():
        model = model.cuda()    
    torch.backends.cudnn.enabled = True
    model.train()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.8)
    criterion = nn.MSELoss()
    x = torch.randn(128, 800, 1).cuda()
    y = torch.randn(128, 800, 1).cuda()
    
    for epoch in range(epochs):
        print("the current epoch is: ", epoch)
        scheduler.step()               
        rnn_out = model(x)
        plt.plot(y.cpu().detach().numpy()[-1, :], 'b-', label= 'label')
        plt.plot(rnn_out.cpu().detach().numpy()[-1, :], 'go', label='output')
        plt.legend(loc='best')
        plt.show()
        plt.plot(x.cpu().detach().numpy()[-1, :, 0], 'g-', label='input')
        plt.show() 
        
        
        loss = criterion( rnn_out ,y)     # try to prevent the loss suface from being too coarse
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clipping_value)
        optimizer.step() 
            
train_model(epochs, model)        

After a few minutes, cuDNN error will be met.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-e410bf903d5b> in <module>
     52 
     53 
---> 54 train_model(epochs, model)

<ipython-input-4-e410bf903d5b> in train_model(epochs, model, learning_rate, interval_print)
     44         loss = criterion( rnn_out ,y)     # try to prevent the loss suface from being too coarse
     45         optimizer.zero_grad()
---> 46         loss.backward()
     47         torch.nn.utils.clip_grad_norm_(model.parameters(), clipping_value)
     48         optimizer.step()

C:\ProgramData\Anaconda3\envs\pytorch15\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
    196                 products. Defaults to ``False``.
    197         """
--> 198         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    199 
    200     def register_hook(self, hook):

C:\ProgramData\Anaconda3\envs\pytorch15\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     96         retain_graph = create_graph
     97 
---> 98     Variable._execution_engine.run_backward(
     99         tensors, grad_tensors, retain_graph, create_graph,
    100         allow_unreachable=True)  # allow_unreachable flag

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR (_cudnn_rnn_backward_input at ..\aten\src\ATen\native\cudnn\RNN.cpp:931)
(no backtrace available)

If I skip the plot part of code or plot the picture after computing loss and loss.backward(), the code can run normally.
I suspect that the problem occurs because input, model’s output and label go to cpu during plotting, and when computing the loss loss = criterion( rnn_out ,y) and loss.backward(), error somehow appear.
I only know when the problem will appear yet I still don’t know why it appears. Moreover, such an error doesn’t exist before I update few packages in Anaconda(unfortunately I didn’t preserve detail of the previous environment )
So I wonder if you can give some advice about why such an error occurs. Any advice will help.
Thanks.

I’m able to run the code for 600 epochs.
Since the code was running fine before updating some packages, could you create a new virtual environment and install the latest stable PyTorch version there?
The addition of the plotting methods shouldn’t change anything, so it would be interesting to see, which package update might have triggered this issue.

I assume you are not running out of memory, right?

I wonder that what is your OS? Maybe this error will only occur in window’s computer since another one of our lab’s computer with 2080 ti and window 10 also meet this error.
Yes. There’s no running out memory issue.
My current environment is already installed with latest pytorch version.
The detail is listed below.

Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Microsoft Windows 10 Home
GCC version: (i686-posix-dwarf-rev0, Built by MinGW-W64 project) 8.1.0
CMake version: Could not collect

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: GeForce RTX 2080 Ti
Nvidia driver version: 441.22
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.5.0
[pip] torchvision==0.6.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2020.1                      216
[conda] mkl-service               2.3.0            py38hb782905_0
[conda] mkl_fft                   1.0.15           py38h14836fe_0
[conda] mkl_random                1.1.0            py38hf9181ef_0
[conda] pytorch                   1.5.0           py3.8_cuda102_cudnn7_0    pytorch
[conda] torchvision               0.6.0                py38_cu102    pytorch

The previous environment that doesn’t have the issue is pretty old. I can only remember there were CUDA 9.2, Pytorch 1.2.0. Yet, when I try to remake the environment and see whether the error disappear, I fail. So I cannot answer which environment I used to work will not have this error. Sorry.
Thanks.

I’m using Ubuntu 16.04 on all machines. I could try to reproduce this issue on a Windows system with a Titan V, but unfortunately I don’t have an RTX2080Ti built into a Windows machine.

Cannot appreciate more. If you find anything new or have some question to ask, please let me know.
By the way, can I have your code that you’re trying reproduce the issue? Maybe there’s something wrong in my code and I don’t realize.
Many thanks to you.

I’m sorry, you’re right. With or without plotting part doesn’t affect the result, the cudnn_error still occurs. I think it’s just lucky that I can run the code normally two days ago. Now the error still exist.