Error detected in CudnnRnnBackward

thyeros · March 11, 2022, 5:14pm

The following code has an error on GPU, but no problem on CPU. Also, when the batch_size is reduced to 256, it runs fine on GPU too. So, I wonder if this is a bug in torch autograd or in cuda. Can pytorch team take a look into this bug?

This is the error msg when run on GPU with batch_size 512.

NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 11.1
PyTorch version: 1.9.0

[W python_anomaly_mode.cpp:104] Warning: Error detected in CudnnRnnBackward. Traceback of forward call that caused the error:
  File "rnn_error.py", line 41, in <module>
    logits = model(input)
  File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "rnn_error.py", line 20, in forward
    lstm_out, _ = self.lstm(x)
  File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 680, in forward
    self.dropout, self.training, self.bidirectional, self.batch_first)
 (function _print_stack)
Traceback (most recent call last):
  File "rnn_error.py", line 49, in <module>
    loss.backward()
  File "/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.autograd.set_detect_anomaly(True)

class Net(nn.Module):
    def __init__(self, input_dim=80, hidden_dim=512, n_layers=3, embedding_dim=256, target_dim=300000):
        super(Net, self).__init__()
        self.hidden_dim = hidden_dim    
        self.lstm = nn.LSTM(input_dim, hidden_dim, n_layers, batch_first=True)
        self.fc1 = nn.Linear(hidden_dim, embedding_dim)
        self.dropout = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(embedding_dim, target_dim)
        

    def forward(self, x):
        """Forward function."""
        # hidden state set to zeros by default
        lstm_out, _ = self.lstm(x)
        # choosing the last valid frame for Linear layer
        last_out = lstm_out[:, -1]
        
        emb = self.fc1(last_out)
        emb_d = self.dropout(emb)
        logits = self.fc2(emb_d)
        
        return logits
    
    
device = 'cuda:0'    
batch_size = 512

input = torch.randn(batch_size, 800, 80, requires_grad=True).to(device)
target = torch.randint(300000, (batch_size,), dtype=torch.int64).to(device)


model = Net().to(device)
print(model)

logits = model(input)
print('logits.shape = {}'.format(logits.shape))
print('target.shape = {}'.format(target.shape))

loss = F.cross_entropy(logits, target)
print('loss = {:.2f}'.format(loss.item()))

loss.backward()

ptrblck · March 12, 2022, 12:16am

Could you share the output of python -m torch.utils.collect_env as I’m unable to reproduce the issue using your code and a current source build?

thyeros · March 12, 2022, 1:28am

Collecting environment information…
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.10

Python version: 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-4.19.56-1.el7.x86_64-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 418.40.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.3
[pip3] torch==1.10.0
[pip3] torch-summary==1.4.5
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 h2bc3f7f_2
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py37h7f8727e_0
[conda] mkl_fft 1.3.1 py37hd3c417c_0
[conda] mkl_random 1.2.2 py37h51133e4_0
[conda] numpy 1.21.3 pypi_0 pypi
[conda] numpy-base 1.21.2 py37h79a1101_0
[conda] pytorch 1.10.0 py3.7_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch-summary 1.4.5 pypi_0 pypi
[conda] torchaudio 0.10.0 py37_cu113 pytorch
[conda] torchvision 0.11.1 py37_cu113 pytorch

thyeros · March 12, 2022, 1:30am

import torch
torch.backends.cudnn.version()
8200
torch.version.cuda
‘11.3’
torch.version
‘1.10.0’

ptrblck · March 14, 2022, 6:23am

Thanks for the update!
I was able to reproduce the issue using PyTorch 1.11.0+cu113. However, it seems the error was already solved in cuDNN, as the 1.10.0+cu115 pip wheel works fine (as well as a master build with a newer cuDNN release).
You can update via:

pip install install torch==1.11.0+cu115 -f https://download.pytorch.org/whl/cu115/torch_stable.html

thyeros · March 14, 2022, 1:22pm

thyeros:

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.autograd.set_detect_anomaly(True)

class Net(nn.Module):
    def __init__(self, input_dim=80, hidden_dim=512, n_layers=3, embedding_dim=256, target_dim=300000):
        super(Net, self).__init__()
        self.hidden_dim = hidden_dim    
        self.lstm = nn.LSTM(input_dim, hidden_dim, n_layers, batch_first=True)
        self.fc1 = nn.Linear(hidden_dim, embedding_dim)
        self.dropout = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(embedding_dim, target_dim)
        

    def forward(self, x):
        """Forward function."""
        # hidden state set to zeros by default
        lstm_out, _ = self.lstm(x)
        # choosing the last valid frame for Linear layer
        last_out = lstm_out[:, -1]
        
        emb = self.fc1(last_out)
        emb_d = self.dropout(emb)
        logits = self.fc2(emb_d)
        
        return logits
    
    
device = 'cuda:0'    
batch_size = 512

input = torch.randn(batch_size, 800, 80, requires_grad=True).to(device)
target = torch.randint(300000, (batch_size,), dtype=torch.int64).to(device)


model = Net().to(device)
print(model)

logits = model(input)
print('logits.shape = {}'.format(logits.shape))
print('target.shape = {}'.format(target.shape))

loss = F.cross_entropy(logits, target)
print('loss = {:.2f}'.format(loss.item()))

loss.backward()

confirmed it is fixed in cuda115. thanks!