Hi,
I was working on a sequence-to-sequence RNN with variable output size. My particular application domain does not require the output size to exactly match the target sequence, so I decided to stop computing the loss once the EOS token is reached. However, since I am working on batches I have to continue computing the loss for the sequences that have not yet reached the EOS token. Therefore, I decided to use a vector (valid) to store information about the batches that still influence the loss (See code below).
valid = torch.full((batch_size,), True, device=y.get_device(), dtype=bool)
for idx in range(sequence_length):
decoder_output, decoder_hidden = self.decode(decoder_input, decoder_hidden)
# decoder_output.shape = [batch_size, n_classes]
# y.shape = [batch_size, sequence_length]
# loss += criterion(decoder_output, y[:, idx]) # This works fine
loss += criterion(decoder_output[valid], y[valid, idx]) # This causes the error in question
# Sample an output token and use it as the next input.
topv, topi = decoder_output.topk(1)
decoder_input = topi.squeeze().detach()
valid &= decoder_input != eos_index
if not torch.any(valid):
break
Here the criterion type is <class ‘torch.nn.modules.loss.NLLLoss’>
def decode(self, x, hidden):
assert len(x.shape) == 1, "Only handling one input at a time
embed = self.embedding_fun(x)
embed = embed.unsqueeze(0)
input = nn.functional.relu(embed)
#input shape: [1, batch size, embedding dim]
output, hidden = self.gru(input, hidden)
output = self.log_softmax(self.out(output[0])) # log_softmax over dimension 1
return output, hidden
The code as displayed above runs fine for some epochs until the following error ends the process:
(CUDA_LAUNCH_BLOCKING = 1)
line 36, in train_epoch
loss.backward()
File "[...]/envs/CIL-TC/lib/python3.8/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "[...]/envs/CIL-TC/lib/python3.8/site-packages/torch/autograd/__init__.py", line 97, in backward
Variable._execution_engine.run_backward(
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel() INTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1579022027550/work/aten/src/ATen/native/cuda/Indexing.cu:218, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor10361850
The error does not seem to depend on the number of valid elements in the mask, nor does it hide an OOM error.
My question is as follows: Am I doing something wrong or is this bug just masking a more helpful error message?