Hi, I’m facing CUDNN_STATUS_INTERNAL_ERROR too, and I’m wondering where the problem comes from.
The full stack trace is listed below
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-12-3782ca595bc5> in <module>
72 print('end')
73
---> 74 train_model(epochs, train_loader, lstm)
<ipython-input-12-3782ca595bc5> in train_model(epochs, train_loader, model, learning_rate, interval_print)
64 loss_list.append(loss.item())
65 optimizer.zero_grad()
---> 66 loss.backward()
67 torch.nn.utils.clip_grad_norm_(model.parameters(), clipping_value)
68 optimizer.step()
C:\ProgramData\Anaconda3\envs\pytorch15\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
196 products. Defaults to ``False``.
197 """
--> 198 torch.autograd.backward(self, gradient, retain_graph, create_graph)
199
200 def register_hook(self, hook):
C:\ProgramData\Anaconda3\envs\pytorch15\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
96 retain_graph = create_graph
97
---> 98 Variable._execution_engine.run_backward(
99 tensors, grad_tensors, retain_graph, create_graph,
100 allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR (_cudnn_rnn_backward_input at ..\aten\src\ATen\native\cudnn\RNN.cpp:931)
(no backtrace available)
The model I adopt is LSTM, the detail is listed below
from torch.autograd import Variable
import torch.nn.init as init
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, bidirectional, output_dim, dropout=0.3):
super(Net, self).__init__()
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.LSTM = nn.GRU(input_dim, hidden_dim, layer_dim, batch_first=True, bidirectional=bidirectional, dropout=dropout)
self.out = nn.Linear(hidden_dim*2, output_dim, ) if bidirectional else nn.Linear(hidden_dim, output_dim, )
def forward(self, x):
# x: (batch, time_step, input_size)
# LSTM_out: (batch, time_step, output_size)
# h_n: (n_layers, batch, hidden_size)
# h_c: (n_layers, batch, hidden_size)
LSTM_out, h_n = self.LSTM(x, None)
out = self.out(LSTM_out)
return out
lstm = Net(
input_dim=1,
hidden_dim=128,
layer_dim=5,
output_dim=4,
bidirectional=True)
The code can work out fine when torch.backends.cudnn.enabled = True
, yet the code will miserably face the error when torch.backends.cudnn.enabled = False
and I don’t face the error when I am training CNN.
I’m using Pytorch 1.5.0, RTX 2080ti, CUDA 10.2 and CUDNN 7.6.5 for CUDA 10.2.
The working environment detail is listed below
Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2
OS: Microsoft Windows 10 Home
GCC version: (i686-posix-dwarf-rev0, Built by MinGW-W64 project) 8.1.0
CMake version: Could not collect
Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: GeForce RTX 2080 Ti
Nvidia driver version: 441.22
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.5.0
[pip] torchvision==0.6.0
[conda] blas 1.0 mkl
[conda] mkl 2020.1 216
[conda] mkl-service 2.3.0 py38hb782905_0
[conda] mkl_fft 1.0.15 py38h14836fe_0
[conda] mkl_random 1.1.0 py38hf9181ef_0
[conda] pytorch 1.5.0 py3.8_cuda102_cudnn7_0 pytorch
[conda] torchvision 0.6.0 py38_cu102 pytorch
This error has tortured me for a week. I have tried various versions of Pytorch, Cuda and CUDNN, and I even try to copy the environment of others’s successful working environment. Yet nothing works.
Would you mind giving any advice? Any advice would be great.
Thanks.