cuDNN error with simple code

Yoohv_Zo · March 12, 2019, 6:27am

My information:

system: ubuntu:18.04
python: 3.7.1
torch: 1.0.1.post2

Code:

import torch
import torch.nn as nn


class RNN_ENCODER(nn.Module):
    def __init__(self, num_embeddings, input_size=300, drop_prob=0.5,
                 hidden_size=256, layers=1, bidirectional=True):
        super().__init__()
        self.encoder = nn.Embedding(num_embeddings, input_size)
        self.drop = nn.Dropout(drop_prob)
        if bidirectional:
            hidden_size //= 2
        drop_prob = 0 if layers == 1 else drop_prob
        self.rnn = nn.LSTM(input_size, hidden_size, layers, batch_first=True,
                           dropout=drop_prob, bidirectional=bidirectional)
        # self.encoder.weight.data.uniform_(-0.1, 0.1)

    def forward(self, captions, hidden):
        emb = self.drop(self.encoder(captions))
        output, _ = self.rnn(emb, hidden)
        return output


if __name__ == '__main__':
    rnn = RNN_ENCODER(num_embeddings=100).cuda()

error:

Traceback (most recent call last):
  File "src/Models/TextEncoder.py", line 25, in <module>
    rnn = RNN_ENCODER(num_embeddings=100).cuda()
  File "/home/zyoohv/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 260, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/zyoohv/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/zyoohv/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
    self.flatten_parameters()
  File "/home/zyoohv/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

ptrblck · March 12, 2019, 1:12pm

Could you try to disable cuDNN using torch.backends.cudnn.enabled = False and run your code again?
The code you’ve posted here runs fine on my machine, so the error might come from some setup issue on your machine or the way you are using your model.
Here is a small code snippet which is working:

x = torch.randint(0, 100, (1, 1)).to('cuda')
hidden = torch.randn(2, 1, 128).to('cuda')
state = torch.randn(2, 1, 128).to('cuda')
output = rnn(x, (hidden, state))

Yoohv_Zo · March 12, 2019, 3:07pm

very thanks!!!
It’s work for me by using torch.backends.cudnn.enabled = False.

ptrblck · March 12, 2019, 3:13pm

Good to hear, but that’s not really a solution.
Could you post which PyTorch, CUDA and cuDNN versions you are using:

print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())

so that I could try to reproduce this issue?

Yoohv_Zo · March 16, 2019, 3:53pm

The output of the codes are:

1.0.1.post2
9.0.176
7402

And I really want to know the reason of the error too.
Thank you.

ptrblck · March 17, 2019, 1:19am

Thanks for the information.
I just created a conda environment with the same versions and couldn’t reproduce the error.
Which GPU are you using? The code runs fine on a GTX1080Ti.