I’m doing a simple seq2seq encoder-decoder model on batched sequences with varied lengths, and I’ve got it working with the
pad_packed_sequence for the encoder.
Now, after decoding a batch of varied-length sequences, I’d like to accumulate loss only on words in my original sequence (i.e., not on
Originally, I was accumulating loss on the entire batch like so:
loss_function = nn.NLLLoss()
loss = 0
for word in range(max_seq_len_in_batch - 1):
loss += loss_function(output[:,word,:], y_data[:,word+1])
However, that doesn’t take into account variable length decoded sequences. I don’t want to accumulate loss on
<PAD> elements. I then changed the loop to:
# -- for each sequence in the batch
for idx, seq_len in enumerate(ylen):
# -- for each word in the sequence
for word in range(seq_len-1):
loss += loss_function(output[idx,word,:].unsqueeze(0), y_data[idx,word+1])
Granted, that gives a different
loss value than the previous way of calculating it on the entire batch. However, it’s drastically different, by orders of magnitude. I’m wondering if that will affect the backpropagation…
Question 1: Am I doing the above correctly? Should I normalize somehow?
Question 2: What is the correct way to accumulate loss on sequences with variable lengths in batches?