Loss functions, masking and backprop for variable-length sequences

Whilst I do sort of understand how things like pack_padded_sequence work now, I’m still not entirely sure how padding for variable-length sequences should look in the grand scheme of things. I have a bunch of variable-length sentences that pass through (oversimplifying a wee bit here) - a) an Embedding layer, b) a biLSTM, c) a Linear layer.

What I’m doing right now is something like:

  • pad all sentences with 0s, to the length of the longest sentence (S) in the minibatch - [B x S]

  • get embeds - [B * S * E]

  • pack_padded_sequence(embeds_out, sentence_sizes) – LSTM

  • ReLU on a dense layer on pad_packed_sequence(lstm_output)[0]; output should be [B x S x S]

  • calculate NLLLoss(reduce=False) after LogSoftmax; pairwise multiplication w/ masking tensor; sum() across two dimensions and average over number of unmasked elements.

I’m not entirely sure whether the way I’m calculating the loss is legit - can I multiply the loss by the mask outside the function calling it, like so:

loss = self.criterion(y_pred, y) * mask).sum().sum()

or do I need to specifically subclass NLLLoss and roll my own loss function that lets me do this? How does backprop work for something as seemingly non-differentiable as masking?

I‘m not quite sure what your question is. Why do you have to use mask? There is a argument ignore_index of NLLLoss . Does this help you?

From what I gather, ignore_index works for the entire minibatch and not per sample.

I’m trying to mask things because I don’t want my network to learn that it can just say ‘pad’ everywhere and get a lower loss for shorter sentences.