Cudnn error while using nn.Embedding and LSTM

averma · October 3, 2018, 6:09am

I am trying to use code for image captioning given in the pytorch tutorial. I have changed the size of vocab and correspondingly the captions.
But now at the step of Embedding(captions) i am getting the following error

(Pdb) embeddings[1]
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCTensorCopy.c line=70 error=59 : device-side assert triggered
*** RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/THC/generic/THCTensorCopy.c:70

Actually while running the code step by step the line embedding = embed(captions) executed and I can also see its size but when i tried to print its values or execute the next line(i.e. embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)) I got the above mentioned error.
I have no idea what is the problem. Thanks in advance for any help.

averma · October 3, 2018, 6:14am

Running the code with CUDA_LAUCH_BLOCKING=1 gives the following

Namespace(batch_size=2, caption_path='/home/ashishverma/Documents/Codes/LSTM_eye_traj/data/train/', crop_size=224, embed_size=256, hidden_size=512, learning_rate=0.001, log_step=10, model_path='models/', num_epochs=5, num_layers=1, num_workers=2, save_step=1000, vocab_length=2552)
torch.Size([24, 256])
Traceback (most recent call last):
  File "lstm_train.py", line 105, in <module>
    main(args)
  File "lstm_train.py", line 62, in main
    outputs = decoder(features, captions, lengths)
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ashishverma/Documents/Codes/LSTM_eye_traj/lstm_model.py", line 44, in forward
    hiddens, _ = self.lstm(packed)
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 192, in forward
    output, hidden = func(input, self.all_weights, hx, batch_sizes)
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 323, in forward
    return func(input, *fargs, **fkwargs)
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 273, in forward
    handle = cudnn.get_handle()
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 358, in get_handle
    handle = CuDNNHandle()
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 172, in __init__
    check_error(lib.cudnnCreate(ctypes.byref(ptr)))
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 345, in check_error
    raise CuDNNError(status)
torch.backends.cudnn.CuDNNError: 2: b'CUDNN_STATUS_ALLOC_FAILED'
Exception ignored in: <bound method CuDNNHandle.__del__ of <torch.backends.cudnn.CuDNNHandle object at 0x7fb5f8bb84e0>>
Traceback (most recent call last):
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 176, in __del__
    check_error(lib.cudnnDestroy(self))
ctypes.ArgumentError: argument 1: <class 'TypeError'>: Don't know how to convert parameter 1

ptrblck · October 3, 2018, 6:22am

Is your code running fine on the CPU?
If so, could you try to disable cuDNN with torch.backends.cudnn.enabled = False and run it again?

averma · October 3, 2018, 6:27am

while running the code on CPU the error is

Namespace(batch_size=2, caption_path='/home/ashishverma/Documents/Codes/LSTM_eye_traj/data/train/', crop_size=224, embed_size=256, hidden_size=512, learning_rate=0.001, log_step=10, model_path='models/', num_epochs=5, num_layers=1, num_workers=2, save_step=1000, vocab_length=2552)
Traceback (most recent call last):
  File "lstm_train.py", line 105, in <module>
    main(args)
  File "lstm_train.py", line 62, in main
    outputs = decoder(features, captions, lengths)
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ashishverma/Documents/Codes/LSTM_eye_traj/lstm_model.py", line 39, in forward
    embeddings = self.embed(captions)
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 108, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/ashishverma/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1076, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at /opt/conda/conda-bld/pytorch_1524590031827/work/aten/src/TH/generic/THTensorMath.c:343

averma · October 3, 2018, 6:46am

I solved it. Error was in my code which is related to number of classes. Actually I have given the vocab size manually and did not consider index 0 as of one of the class.
Thanks @ptrblck for your instant replies.