Calculating loss on sequences with variable lengths

shavitamit · November 13, 2017, 8:21pm

I’m doing a simple seq2seq encoder-decoder model on batched sequences with varied lengths, and I’ve got it working with the pack_padded_sequence and pad_packed_sequence for the encoder.

Now, after decoding a batch of varied-length sequences, I’d like to accumulate loss only on words in my original sequence (i.e., not on <PAD>s)

Originally, I was accumulating loss on the entire batch like so:

loss_function = nn.NLLLoss()
loss = 0
for word in range(max_seq_len_in_batch - 1):
    loss += loss_function(output[:,word,:], y_data[:,word+1])

However, that doesn’t take into account variable length decoded sequences. I don’t want to accumulate loss on <PAD> elements. I then changed the loop to:

# -- for each sequence in the batch
for idx, seq_len in enumerate(ylen):
    
    # -- for each word in the sequence
    for word in range(seq_len-1):       
        loss += loss_function(output[idx,word,:].unsqueeze(0), y_data[idx,word+1])

Granted, that gives a different loss value than the previous way of calculating it on the entire batch. However, it’s drastically different, by orders of magnitude. I’m wondering if that will affect the backpropagation…

Question 1: Am I doing the above correctly? Should I normalize somehow?
Question 2: What is the correct way to accumulate loss on sequences with variable lengths in batches?

Thanks.

colesbury · November 14, 2017, 7:44am

They’re different because by default NLLLoss averages over the number of observations. See:

http://pytorch.org/docs/master/nn.html#torch.nn.NLLLoss

Set size_average to False and divide the loss by the number of non-padding tokens. That should give you approximately the same value.

It will affect back-propagation in the same way that scaling your learning rate affects it: scaling the loss by X scales the gradients by the same factor X.

shavitamit · November 14, 2017, 12:35pm

Thanks, that helps. One more thing I’ve noticed is that by adding that
second loop, my time per epoch is slowed down by almost four-fold.

Are there ways to make that calculation more efficient?

colesbury · November 14, 2017, 7:05pm

NLLLoss has an ignore_index. Use the batched version, but set ignore_index to your padding value.

shavitamit · November 14, 2017, 7:06pm

Thanks - that solved it. Perfect.

shavitamit · November 14, 2017, 7:08pm

By the way - is there a way to avoid the outer-most loop? (as shown below):

loss_function = nn.NLLLoss(ignore_index=out_word2idx['<PAD>']).cuda()
loss = 0
for word in range(max_seq_len-1):
    loss += loss_function(output[:,word,:], y_data[:,word+1])

I’m wondering if it’s possible to avoid the loop over individual words in the sequence altogether.

colesbury · November 14, 2017, 7:26pm

Shouldn’t this work?

loss_function = nn.NLLLoss(ignore_index=out_word2idx['<PAD>']).cuda()
loss = loss_function(output, y_data)

FWIW, I find it generally cleaner to use the “functional” version, but it’s up to you:

import torch.nn.functional as F
loss = F.nll_loss(output, y_data, ignore_index=out_word2idx['<PAD>'])

shavitamit · November 14, 2017, 7:32pm

I’ll give the “functional” version a try. your first suggestion (loss_function(output,y_data)) does not work because I guess it expects either 2 or 4 dimensions.

One approach to doing it all in one line is actually like so:

(torch.gather(-output[:,:-1,:],2,y_data[:,1:].unsqueeze(2)).squeeze().data * mask.cuda()).sum()

Where mask corresponds to a [batch_sz, max_seq_len-1] tensor of 1s and 0s corresponding to word or <PAD>, respectively. That’s a pretty ugly solution in my opinion, and I haven’t even tried to check whether it’s faster or slower. Plus, it requires the creation of a new mask tensor that needs to be copied to the GPU for each batch evaluation.

I’ll give your “functional” solution a try and hopefully that solves it.

lyan62 · February 18, 2018, 6:37pm

Hi did you work out a solution? I’m having a same problem

shavitamit · February 18, 2018, 8:22pm

Which exact question are you referring to? Not accumulating loss on <PAD> elements? If yes, I did solve that using the ignore_index argument (see above).

aspiringcoder · January 14, 2022, 2:48pm

Hi @shavitamit, I’m facing the same issue.

Target tensor of BxS where S is variable length padded with <PAD> token.

Output of my model is BxV where V is a probability distribution over the vocabulary.

I’ve currently ended up doing exactly what you did originally:

crit = NLLoss(ignore_index='<PAD>')
for token in range(seqlen):
   pred,hidden = model(words[:,token],hidden)
   loss += crit(pred,target[:,token])

I’d love to figure out a smarter way to do it, and I do have a mask, flattening all my tokens without the padding token, but then the dimensions don’t add up, as the loss function expects a two dimensional input.