Whilst I do sort of understand how things like pack_padded_sequence work now, I’m still not entirely sure how padding for variable-length sequences should look in the grand scheme of things. I have a bunch of variable-length sentences that pass through (oversimplifying a wee bit here) - a) an Embedding layer, b) a biLSTM, c) a Linear layer.
What I’m doing right now is something like:
-
pad all sentences with 0s, to the length of the longest sentence (S) in the minibatch -
[B x S] -
get embeds -
[B * S * E] -
pack_padded_sequence(embeds_out, sentence_sizes)– LSTM -
ReLU on a dense layer on
pad_packed_sequence(lstm_output)[0]; output should be[B x S x S] -
calculate
NLLLoss(reduce=False)afterLogSoftmax; pairwise multiplication w/ masking tensor; sum() across two dimensions and average over number of unmasked elements.
I’m not entirely sure whether the way I’m calculating the loss is legit - can I multiply the loss by the mask outside the function calling it, like so:
loss = self.criterion(y_pred, y) * mask).sum().sum()
or do I need to specifically subclass NLLLoss and roll my own loss function that lets me do this? How does backprop work for something as seemingly non-differentiable as masking?