Fresh pytorch install, checking if cuda works, gets RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

NanyiJiang · April 15, 2019, 4:32pm

OS: Ubuntu 18.04
GPU: RTX 2080 Ti with Driver @ 418.43

environment:
fresh install of pytorch==1.0.1.post2

code to reproduce:

# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

lstm = nn.LSTM(3, 3).cuda()  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs.cuda(), hidden.cuda())
print(out)
print(hidden)

which works without the cuda() calls.

outputs:

I! CuDNN (v7402) function cudnnCreate() called:
i! Time: 2019-04-15T12:31:28.103299 (0d+0h+0m+3s since start)
i! Process=28964; Thread=28964; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnCreateDropoutDescriptor() called:
i! Time: 2019-04-15T12:31:28.104148 (0d+0h+0m+3s since start)
i! Process=28964; Thread=28964; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v7402) function cudnnSetDropoutDescriptor() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i!     dropout: type=float; val=0.000000;
i!     states: location=dev; addr=NULL_PTR;
i!     stateSizeInBytes: type=size_t; val=0;
i!     seed: type=unsigned long long; val=0;
i! Time: 2019-04-15T12:31:28.104179 (0d+0h+0m+3s since start)
i! Process=28964; Thread=28964; GPU=0; Handle=0x8864f570; StreamId=(nil) (defaultStream).


I! CuDNN (v7402) function cudnnDestroyDropoutDescriptor() called:
i! Time: 2019-04-15T12:31:28.105787 (0d+0h+0m+3s since start)
i! Process=28964; Thread=28964; GPU=NULL; Handle=NULL; StreamId=NULL.

Traceback (most recent call last):
  File "rnn.py", line 10, in <module>
    lstm = nn.LSTM(3, 3).cuda()  # Input dim is 3, output dim is 3
  File "xxxxx/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 260, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "xxxxx/venv/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
    self.flatten_parameters()
  File "xxxxx/venv/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I! CuDNN (v7402) function cudnnDestroy() called:
i! Time: 2019-04-15T12:31:28.129815 (0d+0h+0m+3s since start)
i! Process=28964; Thread=28964; GPU=NULL; Handle=NULL; StreamId=NULL.

ptrblck · April 15, 2019, 9:33pm

Did you build PyTorch from source?
Which CUDA version are you using?
Based on other threads it seems you should use CUDA10 for your RTX card.

NanyiJiang · April 18, 2019, 9:35pm

Coming back with an update:

so we tried to wipe our ubuntu installation and reisntall driver and allennlp(which has pytorch as a depencdency), didnt work. that is when we thought that there could be something off about our installation of pytorch. turns out the cuda version installed by AllenNLP is 9.0 while if we uninstalled pytorch, reinstalled with 10.0, it worked.

this is something i plan to report to allennlp so they may be able to take a look, but that is how we resolved our issues.