cuDNN error using LSTM

MohammedAljahdali · February 22, 2021, 1:05pm

Using the following model:

class Model(nn.Module):

    def __init__(self, n_classes, num_layers, hidden_size):
        super(Model, self).__init__()
        self.blstm = nn.LSTM(input_size=96, hidden_size=hidden_size, num_layers=num_layers, dropout=0, bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, n_classes)
        self.softmax = nn.LogSoftmax(dim=2)

    def forward(self, x):
        # x of shape bs, 1, h, w
        x = x.squeeze(1)
        # x of shape bs, h, w
        x = x.permute(2, 0, 1)
        # x of shape w, bs, h
        x, _ = self.blstm(x, None)
        # x of shape w, bs, 512
        x = self.fc(x)
        # x of shape w, bs, n_classes
        x = self.softmax(x)
        return x

I get the following error:

    x, _ = self.blstm(x, None)
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\nn\modules\rnn.py", line 581, in forward
    result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

This error happens only on GPU, when I run my model on CPU everything works fine.

ptrblck · February 23, 2021, 6:00am

Could you post the input shapes you are using as well as the PyTorch, CUDA, cudnn versions and the used GPU so that we could reproduce the issue, please?

MohammedAljahdali · February 23, 2021, 4:42pm

Pytorch=1.7.1
py=3.8
cuda=102
cudnn=7

As for the input shape, it’s (batch_size, channels, fixed_height, width)
an example would be (16, 1, 96, 620) or (8, 1, 96, 702)
So essentially it’s a gray scale image with fixed height and adaptive width.
Thank you

ptrblck · February 23, 2021, 7:39pm

Thanks! Could you post the used GPU, as I’m unable to reproduce this error using:

model = Model(2, 1, 16).cuda()
x = torch.randn(16, 1, 96, 620).cuda()
out = model(x)

on my machine.

MohammedAljahdali · February 23, 2021, 7:42pm

Sorry I forgot to do so, my GPU is 2080 RTX. As for the specific model size I have 2 hidden layers and 200 hidden units, and 51 class. Moreover, the error does not occur right away, it occurs early on in the training loop, I will try to check the size and content of that batch, to see if it is related.

MohammedAljahdali · February 23, 2021, 8:07pm

This is an image during a debugging when the error occurs.

The size of the image when this error happens is: (8, 1, 96, 609)
In the expect statement an error occured when print(img) get called, which is:

    print(img)
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\tensor.py", line 179, in __repr__
    return torch._tensor_str._str(self)
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 372, in _str
    return _str_intern(self)
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 352, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 241, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 273, in get_summarized_data
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 273, in <listcomp>
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 275, in get_summarized_data
    return torch.stack([get_summarized_data(x) for x in self])
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 275, in <listcomp>
    return torch.stack([get_summarized_data(x) for x in self])
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 273, in get_summarized_data
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 273, in <listcomp>
    return torch.stack([get_summarized_data(x) for x in (start + end)])
  File "C:\Users\Mohammed\.conda\envs\dl_env\lib\site-packages\torch\_tensor_str.py", line 266, in get_summarized_data
    return torch.cat((self[:PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems:]))
RuntimeError: CUDA error: unspecified launch failure

ptrblck · February 23, 2021, 9:46pm

Thanks! I’ll try to reproduce it on the mentioned device and the same setup.

MohammedAljahdali · March 5, 2021, 12:23pm

Mr ptrblck, are there any updates?

This issue occurred also on other models that use LSTM, but they only occurred once, and whenever I re-run the same script, the issue does not occur again on the other models.

As for the original model that always fails, nothing changed and it still fails, early on the first epoch.