nn.Embedding messes up cuda tensors

rnn_lstm · July 22, 2018, 3:58am

This is the model I am using:

class DecoderRNN(nn.Module):

  def __init__(self, embed_size,vocab_size, hidden_size, num_layers=1):       
      super(DecoderRNN, self).__init__()                                      
      self.embed = nn.Embedding(vocab_size, embed_size)                       
      self.linear = torch.nn.Linear(2048,embed_size)                          
      self.bn = nn.BatchNorm1d(embed_size, momentum=0.01)                     
      self.gru = nn.GRU(embed_size, hidden_size, num_layers, batch_first=True)
                                                                 
  def forward(self, features, captions, lengths):                                         
      pdb.set_trace()                                                         
      features = self.linear(features)                                        
      embeddings = self.embed(captions)                                       
      pdb.set_trace()                                                         
      embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)          
      packed = pack_padded_sequence(embeddings, lengths, batch_first=True)◀   
      hiddens, _ = self.gru(packed)                                           
      outputs = self.linear(hiddens[0])                                       
      return outputs

However after forward pass through embedding, all tensors on cuda are messed up. The tensors which are not cuda tensors seems to be fine. I get the following error. The following should clarify

(Pdb) type(features.data)
<class ‘torch.cuda.FloatTensor’>
(Pdb) features.data

2.9174 1.9323 0.8640 … 0.1553 0.9829 0.8675
[torch.cuda.FloatTensor of size 1x2048 (GPU 0)]

(Pdb) aa = self.embed(captions)
(Pdb) aa.data
THCudaCheck FAIL file=/py/conda-bld/pytorch_1493669264383/work/torch/lib/THC/generic/THCTensorCopy.c line=65 error=59 : device-side assert triggered
*** RuntimeError: cuda runtime error (59) : device-side assert triggered at /py/conda-bld/pytorch_1493669264383/work/torch/lib/THC/generic/THCTensorCopy.c:65
(Pdb) aa.data.contiguous()
*** RuntimeError: cuda runtime error (59) : device-side assert triggered at /py/conda-bld/pytorch_1493669264383/work/torch/lib/THC/generic/THCTensorCopy.c:65
(Pdb) features
*** RuntimeError: cuda runtime error (59) : device-side assert triggered at /py/conda-bld/pytorch_1493669264383/work/torch/lib/THC/generic/THCTensorCopy.c:65
(Pdb) lengths

5332
[torch.LongTensor of size 1]

flauted · July 22, 2018, 5:12am

On first glance, it looks like your shapes aren’t right.

features is (1, D), where D=2048. embeddings on the line = self.embed... appears to be (1, S, E) where S is sequence length, presumably 5332.

So it would seem to me that cat only works if E=D, packing works if S = lengths - 1, and self.linear requires hidden_size = D.

Given the stack trace, it seems to me that the problem is with S, or lengths.

rnn_lstm · July 22, 2018, 5:24am

Thanks for the reply.
Although the sizes seem to be correct. ‘embed_size’ is 256, and length of vocabulary is 256(taking values from 0 to 255) . Therefore, ‘self.embed’ should embed to 256-dimensional vector. The out of ‘linear’ is [1,256] and that of the embedding should be [1,5332,256]. In the concat, I do features.unsqueeze which will introduce one more dimension and hence the dimensions should not be a problem.
On further inspection, I realize that when I try to print embedding, right after it executes it, it gives the same error. So, somehow, embedding is getting messed up, although it does executes that statement correctly.

flauted · July 22, 2018, 5:32am

Interesting. I wonder if you accidentally have values beyond 255 in your caption. That usually gives a nastier looking error, but it could explain what you’re seeing.

You might try running this with the environment variable CUDA_LAUNCH_BLOCKING=1. That usually will make the stack trace point to the line where things are actually going wrong.

rnn_lstm · July 23, 2018, 1:43pm

Thanks so much.
I was able to debug . The captions did have entries greater than 255.