[solved] Model.cuda() raises Error: torch.backends.cudnn.CuDNNError: 4: b'CUDNN_STATUS_INTERNAL_ERROR

sg314 · April 5, 2018, 4:49pm

test_arch.py:

import torch
import torch.nn as nn

class net_lstm(nn.Module):
    def __init__(self):
        super(net_lstm, self).__init__()

        self.lstm = nn.LSTM(10, 10)

    def forward(self, input):
        return input


class net_fc(nn.Module):
    def __init__(self):
        super(net_fc, self).__init__()

        self.fc = nn.Linear(10, 10)

    def forward(self, input):
        return input

The following code for Linear layer works perfectly:

>>> import test_arch
>>> model_fc = test_arch.net_fc()
>>> model_fc
net_fc(
  (fc): Linear(in_features=10, out_features=10, bias=True)
)
>>> model_fc.cuda()
net_fc(
  (fc): Linear(in_features=10, out_features=10, bias=True)
)

But following for LSTM layer raises error:

>>> model_lstm = test_arch.net_lstm()
>>> model_lstm
net_lstm(
  (lstm): LSTM(10, 10)
)
>>> model_lstm.cuda()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cse/ug/14074017/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 216, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/cse/ug/14074017/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 146, in _apply
    module._apply(fn)
  File "/home/cse/ug/14074017/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in _apply
    self.flatten_parameters()
  File "/home/cse/ug/14074017/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 85, in flatten_parameters
    handle = cudnn.get_handle()
  File "/home/cse/ug/14074017/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 296, in get_handle
    handle = CuDNNHandle()
  File "/home/cse/ug/14074017/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 110, in __init__
    check_error(lib.cudnnCreate(ctypes.byref(ptr)))
  File "/home/cse/ug/14074017/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 283, in check_error
    raise CuDNNError(status)
torch.backends.cudnn.CuDNNError: 4: b'CUDNN_STATUS_INTERNAL_ERROR'
>>> 
Exception ignored in: <bound method CuDNNHandle.__del__ of <torch.backends.cudnn.CuDNNHandle object at 0x7f0a2230fba8>>
Traceback (most recent call last):
  File "/home/cse/ug/14074017/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 114, in __del__
    check_error(lib.cudnnDestroy(self))
ctypes.ArgumentError: argument 1: <class 'TypeError'>: Don't know how to convert parameter 1
KeyboardInterrupt

I tried updating to latest PyTorch version 0.3.1.post2, deleting ~/.nv and CUDA_CACHE_PATH='/home/your-unixname/cudacache' python main.py.
Please help.

sg314 · April 5, 2018, 5:39pm

Ok, I solved it. The GPU was running out of memory, and probably it didn’t even have enough memory to print

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCStorage.cu:58

After freeing up GPU’s memory, the code ran fine.