Google Colab RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

carlo_david · February 3, 2020, 1:19am

I am using google colab to train a Bidirectional RNN model and I get the error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-34-0029e71ae99b> in <module>()
     20             inputs, labels = inputs.to(device), labels.to(device)
     21 
---> 22             output = model(inputs)
     23             loss = criterion(output.squeeze(), labels.float())
     24             optimizer.zero_grad()

5 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in forward_impl(self, input, hx, batch_sizes, max_batch_size, sorted_indices)
    524         if batch_sizes is None:
    525             result = _VF.lstm(input, hx, self._get_flat_weights(), self.bias, self.num_layers,
--> 526                               self.dropout, self.training, self.bidirectional, self.batch_first)
    527         else:
    528             result = _VF.lstm(input, batch_sizes, hx, self._get_flat_weights(), self.bias,

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I tried this solution 1 and this 2 and still get the error.

Here’s my BiRnn Model code:

class BiRNN(nn.Module):
    def __init__(self, n_vocab, n_embed, hidden_size, seq_len, num_layers, output_size, drop_prob):
        super(BiRNN, self).__init__()
        self.hidden_size = hidden_size
        self.seq_len = seq_len
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(n_vocab, n_embed)
        self.lstm = nn.LSTM(n_embed, hidden_size, num_layers, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(hidden_size*2, output_size)

    def forward(self, x):
       
         # Set initial states
        h0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(device)  
        c0 = torch.zeros(self.num_layers*2, x.size(0), self.hidden_size).to(device)
        x = self.embedding(x).to(device)
        # Forward propagate LSTM
        lstm_out, _ = self.lstm(x, (h0, c0))  
        lstm_out = lstm_out.contiguous().view(-1, self.seq_len, 2, self.hidden_size)
        # get backward output in first node
        lstm_out_bw = lstm_out[:, 0, 1, :]
        # get forward output in last node
        lstm_out_fw = lstm_out[:, -1, 0, :]
        lstm_out = torch.cat((lstm_out_fw, lstm_out_bw), -1)
        drop_out = self.dropout(lstm_out)
        logits = self.fc(drop_out)

        return logits

ptrblck · February 3, 2020, 5:12am

Which PyTorch, CUDA, and cudnn version are you using?

carlo_david · February 3, 2020, 5:17am

im using google colab, which has the default version of pytorch 1.3, and CUDA 10.1

ptrblck · February 4, 2020, 2:56am

Issue tracked here.

carlo_david · February 4, 2020, 3:52am

its working now. I tried to train it on CPU on few epochs and after some steps, it shows another error which is Embeddings index out of range error. Then I resolve the error. I wonder why it doesnt show this error when training on CUDA

Lornatang · February 4, 2020, 4:09am

I ran into and fixed the problem, you just need to restart Google Colab