Whilst I do sort of understand how things like pack_padded_sequence
work now, I’m still not entirely sure how padding for variable-length sequences should look in the grand scheme of things. I have a bunch of variable-length sentences that pass through (oversimplifying a wee bit here) - a) an Embedding
layer, b) a biLSTM
, c) a Linear
layer.
What I’m doing right now is something like:
-
pad all sentences with 0s, to the length of the longest sentence (S) in the minibatch -
[B x S]
-
get embeds -
[B * S * E]
-
pack_padded_sequence(embeds_out, sentence_sizes)
– LSTM -
ReLU on a dense layer on
pad_packed_sequence(lstm_output)[0]
; output should be[B x S x S]
-
calculate
NLLLoss(reduce=False)
afterLogSoftmax
; pairwise multiplication w/ masking tensor; sum() across two dimensions and average over number of unmasked elements.
I’m not entirely sure whether the way I’m calculating the loss is legit - can I multiply the loss by the mask outside the function calling it, like so:
loss = self.criterion(y_pred, y) * mask).sum().sum()
or do I need to specifically subclass NLLLoss
and roll my own loss function that lets me do this? How does backprop work for something as seemingly non-differentiable as masking?